- 


May 2010 + 
The Smart Guide to Building World -Class Applications _ 


A PENTON PUBLICATION 


r 
anugions fo 
$ storage » | 


í Server Developers 


Eduardo 
« Marketing Analyst 
/ tb ma 


C2 
Ñ — 
Julian | y 
Database Admin \ 
aa AN $ 


el 


ER Gifice SharePoint ag SOL Server 2008 


Server 2007 


——— 


Intelligent people. 
Intelligen CISIONS. 


Fiicient business. 


What happens when you combine familiar Microsoft® Office tools with the robust analysis and reporting of SQL Server®2008 and 
the information-access and sharing capabilities of SharePoint®? You get people analyzing data from multiple sources and building 
reports on their own. Real-time, informed decision-making without the intervention of IT? Pretty good math, by any standards. 


To learn more about how better decision-making can create efficiencies, go to itseverybodysbusiness.com/decision 


ue Snap this tag to learn more 
4 about better decision-making 
m or text DECISION to 21710 
x Get the free app for your phone at 3 A ; 
Ae http://gettag.mobi Because it's everybody's Jl business 


; 


7 
- 


O. 
þ= 
Nn. 
& 
Habel 
> 
fe] 
v 


w 
af 


«Bins spor. 


WWW.SQLMAG.COM 
MAY 2010 Vol eR 


x Database Professionals 


a å aos 2010’s new code-editing features? 
port forgthe 
falsity 


~ database projects, and su S uP 
» NETFramework 4.0 and 


* Framework 4.0 lotou more easily, 


design and develop applications 


FEATURES 


1 5 Optimizing TOP N Per Group 
Queries 
—Itzik Ben-Gan 
Compare solutions to TOP N Per Group queries, based 
on the ROW_NUMBER function, the APPLY opera- 
tor, and grouping and aggregation of concatenated 
elements. 


21 Efficient Data Management 
in SQL Server 2008, Part 2 


—Don Kiely 

Sparse columns can give you significant storage sav- 
ings, and you can use column sets to efficiently update 
or retrieve sparse column data. Filtered indexes work 
with sparse columns and nonsparse columns to provide 
improved query performance. 


27 Is Your Physical Disk I/O 
Affecting SQL Server 
Performance? 


—Andrew J. Kelly 

Get a handle on the fundamentals of acceptable physi- 
cal I/O for SQL Server and learn how to determine 
whether it’s affecting your system’s performance. 


29 How to Create PowerPivot 


39 


Applications in Excel 2010 
—Derek Comingore 

With PowerPivot, Microsoft is closer to its goal of 
bringing business intelligence (BI) to the masses. Learn 
what PowerPivot is and how to use its client-side 
components to create a self-service BI application. 


improving the Performance 
of Distributed Queries 

—Will Alber 

Writing a distributed query might appear to be easy 
but there are many potential pitfalls. Here are some 
strategies for avoiding problems when you're writing 
them and tweaking them afterward. 


Editor’s Tip 


Be sure to stop by the SQL 


Server Magazine booth at 

TechEd June 7—10 in New 

Orleans. We'd love to hear your feedback! 
—Megan Keller, associate editor 
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Altova DatabaseSpy 2010 


—Michael K. Campbell 

Altova DatabaseSpy provides a set of standardized features 
for interacting with a wide range of database platforms at the 
database level. 


Industry News: 

Bytes from the Blog 

Microsoft has revealed that it plans to release an SP4 for 
SQL Server 2005. Find out when and where to go from here. 


New Products 

Check out the latest products from Confio Software, 
DCF Technologies, DB Software Laboratory, and 
Nob Hill Software. 
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MapForce® — graphical data mapping, transformation 
& conversion tool 


XMLSpy® — industry-leading XML editor with strong 
database integration 


StyleVision® — visual stylesheet and electronic forms 
design tool for publishing DB data 


Download a 30 day free trial! 


Try before you buy with a free, 
fully functional, 30-day trial 
from www.altova.com. 


The Missing Link in 


Self-Service BI 


eR business intelligence (BI) is a great 
thing—it brings the power of analysis to knowl- 
edge workers and decision makers who can make use 
of it directly. However, after hearing Microsoft’s push 
on self-service BI a few times and comparing that 
vision to the reality that most businesses aren’t heav- 
ily invested yet in BI, I’m sensing a disconnect. Pm 
concerned that there’s an implication that you can 
just get these new products and your BI is good to 
go. For example, there has been a lot of emphasis on 
the SQL Server 2008 R2 PowerPivot for Excel front- 
end tool (which is truly pretty cool). However, there’s 
more than a little hand waving over the ugly infra- 
structure part in the back where all the BI data comes 
from. It would be a bit too easy for an IT manager 
or C-title to hear about self-service BI or see a demo 
and think that by getting PowerPivot they’re going to 
get instant BI. Experienced DBAs and BI profession- 
als know it isn’t really that simple, but management 
could easily overlook the behind-the-scenes work in 
their enthusiasm to bring BI into their organizations. 


Building the Back-End Infrastructure 

Building a BI infrastructure requires careful plan- 
ning and design. The SQL Server 2008 R2 and SQL 
Server 2008 Enterprise and Datacenter Editions 
provide the BI subsystems required to build this in- 
frastructure, but someone still has to design the data 
warehouse and its SQL Server Analysis Services 
(SSAS) databases. This requires an understanding of 
OLAP structures, such as cube design, as well as an 
awareness of the business and the types of informa- 
tion that knowledge workers will need. SSAS and the 
BI Development Studio (BIDS) provide the tools, 
but the knowledge of what to build must come from 
your own organization. The information required by 
the business will guide your decisions about how to 
build the data warehouse. After the data warehouse 
is built, you need to design and implement an extrac- 
tion, transformation, and loading (ETL) process 
that will load your data warehouse with OLTP data. 
SQL Server Integration Services (SSIS) provides this 
capability. But there’s more: Remember that to im- 
plement self-service BI you'll need an infrastructure 
upgrade in addition to desktop updates for the work- 
ers who will make use of it. On the infrastructure 
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side you'll need SQL Server 2008 R2 to get Power- 
Pivot. Plus, you'll need SharePoint 2010 so that you 
can archive and manage the PowerPivot workbooks. 
On the client side you'll need Office 2010. For more 
about self-service BI, see “Donald Farmer Discusses 
the Benefits of Managed Self-Service BI,” October 
2009, InstantDoc ID 102613. 


The Devil Is in the Details 

The job of a Microsoft product manager is to make 
new features in the product look easy and appealing. 
However, at the same time Microsoft tends to down- 
play the fact that most companies aren’t early adopt- 
ers of new technologies. Plus, Microsoft likes to cre- 
ate an image of a world in which customers migrate 
and quickly master their new technologies. While 
some leading-edge customers fit this image perfectly, 
most businesses aren’t there yet. Most companies 
prefer to be latecomers to technology. They (wisely) 
adopt new technologies after most of the early issues 
have been worked out. For instance, our SOL Mag 
Instant Polls and your letters and feedback show 
that most companies are on SQL Server 2005—not 
SQL Server 2008 and certainly not SQL Server 2008 
R2. Microsoft tends to smooth over the fact that 
the adoption of new technologies like PowerPivot 
would take quite a bit of work for most companies, 
and Microsoft can downplay discussing the devil in 
the details, which is the background work required 
to make the demo magic happen. 


PowerPivot Is Cool, So Prep for It 
Don’t get me wrong. I think PowerPivot and the 
concept of managed self-service BI are both ultra 
cool. They bring the power of data analysis into 
the hands of the business information workers, 
and they let IT manage and protect resources, 
such as Excel spreadsheets, better than ever before. 
However, don’t think this is going to automati- 
cally happen after a new product install. To really 
get the benefits of these tools you need to acquire 
BI expertise and have a BI infrastructure in place. 
Tools alone won’t make that happen. The missing 
link in self-service BI is the back end your DBAs 
and BI professionals need to build. SQL 
InstantDoc ID 104669 
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Sizing Your Transaction Log 


*ve read a lot of conflicting advice on the Internet 

regarding how large a database’s transaction log 

should be, ranging from very small to half the size 
of the data. It seems that whichever size I pick, it’s 
wrong and the transaction log grows. How can I more 
accurately determine the correct size to use? 


This is acommon question I’m asked, and the simple 

answer is that there’s no right answer! However, 

there are plenty of wrong answers—including that 

a transaction log should be a certain percentage 

of the size of the data in a database. There’s no 

justification for such advice, and I urge you not to 
follow it. 

There are quite a few factors you need to con- 
sider when figuring out how large a transaction 
log should be, but the size of the transaction log is 
really a balancing act of how quickly log records 
are being generated against how quickly they’re 
consumed by all the various operations that might 
need them to be available. The trick is to determine 
what factors could affect your database and cause 
transaction log growth—no matter what size you 
initially make it. 

It’s relatively easy to figure out the volume of 
transaction log records being generated by an average 
daily workload. To avoid log growth, you need to 
make sure that nothing is causing the transaction 
log records to still be required by SQL Server. Log 
records that are still required can’t be discarded and 
their space reused. The following are some reasons 
the records could still be required: 

e The database is in the FULL or BULK_ 
LOGGED recovery model but transaction log 
backups aren’t being performed, so the transac- 
tion log records can’t be discarded until they have 
been backed up. 

e Database mirroring is enabled and there’s a queue 
of transaction log records that haven’t yet been 
sent to the mirror database. 

e Transactional replication is enabled and the Log 
Reader Agent job hasn’t processed all the trans- 
action log records for the database. 

e There’s a long-running transaction that’s pre- 
venting transaction log records from being 
discarded. 


The transaction log needs to be managed correctly 
to ensure that it doesn’t grow out of control, even if 
you've sized it appropriately. I discuss this in much 
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more depth in my blog post “Importance of proper 
transaction log size management” at www.sqlskills 
.com/BLOGS/PAUL/post/Importance-of-proper- 
transaction-log-size-management.aspx. 

You also need to consider unusual transactions 
that perform very large changes to the database and 
generate a lot of transaction log records—more than 
the regular day-to-day workload. The most common 
culprit here is an index rebuild operation, which is a 
single-transaction operation, so the transaction log 
will potentially grow to accommodate all the transac- 
tion log records it generates. 

Here’s an example scenario. Say the regular 
workload on the PaulsDB database generates 12GB 
of transaction log records every day, so I might 
choose to perform a transaction log backup at hourly 
intervals and assume that I can therefore safely set 
the transaction log size to be 0.5GB. However, I 
need to take into consideration whether that 12GB 
is generated in a uniform manner over 24 hours, 
or whether there are “hot-spots.” In my fictional 
example, I discover that from 9 a.m. to 10 A.M., 4GB 
of transaction log records are generated from a bulk 
load into PaulsDB, with the rest of the log being 
generated uniformly. In that case, my hourly log 
backup won’t be enough to contain the size of the 
transaction log at 0.5GB. I can choose to size the 
transaction log at 4GB to avoid autogrowth or take 
more frequent log backups and keep the transaction 
log smaller. 

In addition, I need to consider other unusual 
transactions. It turns out there’s a 7GB clustered 
index in PaulsDB that’s rebuilt once a week as 
part of regular index maintenance. The BULK_ 
LOGGED recovery model can’t be used to reduce 
transaction log generation, because there are user 
transactions occurring 24 x 7, and switching to 
BULK_LOGGED runs the risk of data loss if a 
disaster occurs while in that recovery model (because 
a tail-of-the-log backup wouldn’t be permitted). So 
the transaction log has to be able to accommodate 
the single-transaction 7GB index rebuild. I have 
no choice but to make the transaction log size for 
PaulsDB 7GB, or alter the regular index mainte- 
nance that’s performed. 

As you can see, it’s not a simple process to 
determine the size of a transaction log, but it’s not 
an intractable problem either, once you under- 
stand the various factors involved in making the 
decision. 
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QUESTIONS 
ANSWERED 


Determining the Position of Search 
Arguments Within Query Plans 


LISTING I: The Search Argument 
in a WHERE Clause 


SELECT so.OrderDate, c.AccountNumber, p.FirstName 
, p.LastName, sp.StateProvinceCode 
FROM Sales.SalesOrderHeader AS so 
INNER JOIN Sales.Customer AS c 
ON so.CustomerID = c.CustomerID 
INNER JOIN Person.Person AS p 
ON c.PersonID = p.BusinessEntityID 
INNER JOIN Person.Address AS pa 
ON so.ShipToAddressID = pa.AddressID 
INNER JOIN Person.StateProvince AS sp 
ON sp.StateProvinceID = pa.StateProvinceID 
WHERE sp.CountryRegionCode = 'US' 
ORDER BY sp.StateProvinceCode, so.OrderDate 


LISTING 2: The Search Argument 
in an INNER JOIN Clause 


SELECT so.OrderDate, c.AccountNumber, p.FirstName 
, p.-LastName, sp.StateProvinceCode 
FROM paleo SalesOrderHeader AS so 
INNER JOIN Sales.Customer AS c 
ON so.CustomerID = c.CustomerID 
INNER JOIN Person.Person AS p 
ON c.PersonID = p.BusinessEntityID 
INNER JOIN Person.Address AS pa 
ON so.ShipToAddressID = pa.AddressID 
INNER JOIN Person.StateProvince AS sp 
ON sp.StateProvinceID = pa. StateProvinceID 
AND sp.CountryRegionCode = 'U 
ORDER BY sp.StateProvinceCode, so. Gee 


LISTING 3: The Search Argument 
in a WHERE Clause When an 
OUTER JOIN Clause Is Used 


SELECT so.OrderDate, c.AccountNumber, p.FirstName 
, p.LastName, sp.StateProvinceCode 
FROM Sales.Customer AS c 
LEFT OUTER JOIN Sales.SalesOrderHeader AS so 
ON so.CustomerID = c.CustomerID 
LEFT OUTER JOIN Person.Person AS p 
ON c.PersonID = p.BusinessEntityID 
LEFT OUTER JOIN Person.Address AS pa 
ON so.ShipToAddressID = pa.AddressID 
LEFT OUTER JOIN Person.StateProvince AS sp 
ON sp.StateProvinceID = pa.StateProvinceID 
WHERE sp.CountryRegionCode = 'US' 
ORDER BY sp.StateProvinceCode, so.OrderDate 


LISTING 4: The Search Argument 
in an OUTER JOIN Clause 


SELECT so.OrderDate, c.AccountNumber, p.FirstName 
p.LastName, sp. StateProvinceCode 
FROM Sales.Customer AS c 
LEFT OUTER JOIN Sales.SalesOrderHeader AS so 
ON so.CustomerID = c.CustomerID 
LEFT OUTER JOIN Person.Person AS p 
ON c.PersonID = p.BusinessEntityID 
LEFT JOIN Person.Address AS pa 
ON so.ShipToAddressID = pa.AddressID 
LEFT JOIN Person.StateProvince AS sp 
ON sp.StateProvinceID = pa.StateProvinceID 
AND sp.CountryRegionCode = 'US' 
ORDER BY sp.StateProvinceCode, so.OrderDate 
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hen writing a join, I always wonder if I can improve per- 

formance by moving the search argument into the JOIN 

clause instead of placing it in the WHERE clause. Listing 1 
shows the search argument in the WHERE clause, and Listing 2 
shows it in a JOIN clause. What effect would this move have, and is 
it beneficial? 


With a WHERE clause, search arguments are applied to all the rows in the 
result set. In an INNER JOIN, the position of the search argument has no 
effect on the result set because the INNER JOIN requires all rows to have 
a match in all joined tables, and all rows have to meet the conditions of the 
WHERE clause argument. That’s an important point to make because if 
this query were an OUTER JOIN, the result set returned would be different 
based on where the search argument is placed. I’ll explain more about this 
later. 

The position of the argument shouldn't change the performance of the 
query plan. In an INNER JOIN, SQL Server will recognize the search argu- 
ment and apply it efficiently based on the plan it chooses. It’s likely that SQL 
Server will filter the data using the argument before the join, because doing so 
usually reduces the overall cost of the join. However, there are other factors, 
such as the indexes that exist, that could influence the query plan and result 
in different behavior, but again, these aren't affected by the position of the 
argument. 

As I mentioned above, OUTER JOINs are different—the position of 
the search argument really matters. When executing an OUTER JOIN, a 
search argument in the FROM clause will define the result set of the INNER 
JOIN before the OUTER JOIN results are added. If we wanted to see sales 
order information, including StateProvincelID, for all US customers but list 
all customers regardless of StateProvinceID, we could ask for an OUTER 
JOIN to describe customers and sales (which also requires joins to Person and 
PersonAddress to expose the StateProvinceID). If we use the base queries you 
mentioned in your question to do this, we'll end up with the queries shown 
in Listing 3 and Listing 4. When executed, you'll find that they have different 
query plans and drastically different results. Listing 3 finds all matching rows 
for customers and only outputs a row if the customer is a US customer. Using 
the AdventureWorks2008 database, there are a total of 12,041 customers where 
the CountryRegionCode is US. 

Listing 4 uses the CountryRegionCode as the predicate for deter- 
mining which rows should return the data for the columns in the select 
list, including sp. StateProvinceID. However, it still produces rows for 
customers that aren’t in the United States. This query returns 32,166 
rows. 

Your question really isn’t performance related. SQL Server can 
optimize your query regardless of the position of a search argument. 
However, I strongly suggest that all search arguments stay in the 
WHERE clause so that you never get caught with incorrect data if 
you decide to change a complex query from an INNER JOIN to an 
OUTER JOIN. ISQL| 
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ORE on the WEB 


See the web figures at 
InstantDoc ID 103679. 


s a developer, you use Visual Studio (VS) to 
A many flavors of applications for the 

.NET Framework. Typically, a new release 
of VS comes with a brand-new version of the .NET 
Framework, and VS 2010 is no exception—it ships 
with the .NET Framework 4.0. However, you can use 
VS 2010 to build applications for any .NET platform, 
including .NET 3.5 and .NET 2.0. 

VS 2010 also includes an improved set of 
design-time facilities such as IntelliSense, refactor- 
ing, code navigation, new designers for workflows, 
Entity Framework-based applications, and WPF 
applications. Let’s take a look at how these features 
help database developers. 


Multimonitor Support 

VS owes a large share of its popularity to its 
integrated development environment (IDE), which 
is made up of language- and feature-specific code 
editors, visual designers, IntelliSense, auto- 
completion, snippets, wizards, controls, and 
more. This IDE is extended in VS 2010 to 
offer the much-requested multimonitor sup- 
port. Writing code with the .NET Frame- 
work requires you to mix designer windows with 
code windows while keeping an eye on things such 
as a database profiler, an HTTP watcher, a specifi- 
cation document, or an entity-relationship model. 
However, doing so requires monitor real estate, and 
reducing the size of the fonts employed is no lon- 
ger an option. Having multiple monitors is a viable 
option because monitors aren’t very expensive and 
are easy to install. Many IT organizations are using 
dual monitors to increase productivity and save time 
and resources. 


Microsoft, in fact, has been supporting the use 
of multiple monitors at the OS level since Windows 
XP. The real pain was getting VS to detect multiple 
monitors and allow all of its windows, including 
code editors, designers, and various dialog boxes, to 
be dragged around outside the border of the IDE’s 
parent window. Web Figure 1 (www.sqlmag.com, 
InstantDoc ID 103679) shows the Float option, 
which enables full multimonitor support in VS 
2010. Once you select this option, you can move the 
window around the entire screen, and even move it to 
another screen. 


A New Code-Writing 
Experience 

Very few editing features in VS ever exceed develop- 
ers’ expectations. In fact, to get an optimal experi- 
ence, you probably want to use VS in conjunction 
with some third-party tools. However, what makes 
VS really great is the huge number of code-editing 
features it offers, and this number grows with each 
new release. VS still leaves room for third-party prod- 
ucts, which offer a lot of additional features, but it 
offers a good-enough coding experience out of the 
box. And the quality of your out-of-the-box cod- 
ing experience increases significantly with each new 
release. 

In VS 2010, IntelliSense includes auto-filtering, 
which gives you the ability to display a context- 
sensitive list of suggestions. In this version, the list 
isn’t limited to an alphabetical sequence or to all 
names starting with the typed sequence. Intelli- 
Sense attempts to guess the context in which you're 
operating and shows related suggestions, and it even 
understands abbreviations. For example, if you type 
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VISUAL STUDIO 2010 ° i OR 


Do more with the .NET Framework 4.0 


WL in an IntelliSense window, it will match member 
names, such as WriteLine. 

Refactoring is an aspect of development that 
has gained of lot of attention in the past few years. 
Refactoring is the process of rewriting the source 
code in a better way (i.e., adding testability, sepa- 
ration of concerns, extensibility) without altering 
the actual behavior. Originally associated with agile 
practices, refactoring is now a common, everyday 
practice for almost every developer. Because refac- 
toring doesn’t add any new behavior to the code, it’s 
often perceived to be a waste of time and is neglected. 
In the long run, however, a lack of systematic refac- 
toring leads to low-quality software or even project 
failures. VS 2010’s refactoring tools are an excellent 
way to speed up the refactoring process, making it 
affordable for nearly any development team. 

Previous versions of VS offer a Refactor menu, 
but VS 2010’s refactoring menu is richer than ever. 
As Figure 1 shows, it comes with more refactoring 
options than earlier versions, and it offers a more 
powerful Find tool to help you resolve missing types, 
namespaces, and assemblies. When you select an 
object in the code editor, it will instantly highlight 
all the references to that particular object, and it 
can also display the call hierarchy for an entity or a 
method. 

With VS 2010, you can seriously consider not 
using a third-party tool to help you with your coding 
chores. However, VS isn’t the only product to improve 
in the number and quality of editing and refactoring 
facilities. In fact, it remains inferior to existing ver- 
sions of commercial products. But if you’ve never 
used a third-party refactoring tool, you'll feel even 
less of a need for one in VS 2010. If you're accus- 
tomed to working with a refactoring tool, dropping it 
because of the new features in VS 2010 probably isn’t 
ideal—you might still want to use or upgrade your 
third-party tool. 


Database Development 
As Figure 2 shows, VS 2010 comes with several 
database-specific projects that incorporate some of the 
features previously available in VS Team System 2008 
Database Edition. You can create a database for SQL 
Server 2008 and SQL Server 2005, as well as NET 
assemblies (i.e., CLR database objects) to run inside 
of SQL Server. You use the familiar VS interface to 
add tables, foreign keys, and functions to the database. 
But is creating a database in VS beneficial? 

There are various benefits of creating a database 
project in VS. First, you'll have all the database objects 
defined in the same place and alongside the code (e.g., 
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a web or Windows application) that’s using the data- 
base. Your schema is captured by a regular project 
and can be added to the code repository and shared 
and versioned the same way as any other aspect of the 
project. Figure 3 shows a database project in action. 

In the Properties section, you'll find settings for 
SQL command variables and permissions, and you 
can add references to a VS database project. Finally, 
in the Scripts section, you'll find two project fold- 
ers named Pre-Deployment and Post-Deployment, 
which contain SQL scripts to run before and after the 
database deployment. 

It’s worth mentioning VS 2010’s data-comparison 
capabilities. This release lets you compare the data in 
a source database with the content stored on a target 
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database for a specific pair of tables. Figure 4 shows 
the New Data Comparison dialog box. After setting 
up a data comparison, you can review the data and 
decide whether you want to synchronize the tables and 
persist the changes. An alternative to updating the tar- 
get table is that you can export an UPDATE T-SQL 
script to a file and run it at a later time. 

There are many ways to take advantage of VS 
2010’s data-comparison capabilities. The most obvious 
usage is to update a development server with a copy of 
the tables you have in your production environment. 
Another common scenario is to use the tool to copy 
data across two or more tables in the same or different 
databases. Finally, you might consider using the Data 
Comparison wizard to compare data as it appears in 
a table before and after you run tests as a way to assert 
the behavior of a piece of code. In fact, it should be 
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noted that both refactoring and testing are project- 
wide features in VS that aren’t limited to C# or Visual 
Basic .NET projects. You can have a few refactoring 
features (i.e., Rename) and testing assertions available 
for database-related projects, too. 

In addition to comparing data in tables, you can 
also compare the schema of two databases. A schema 
comparison operation is aimed at finding structural 
differences between two databases. The detected delta 
is then summarized to a DDL script that you can run 
at a later time to make the databases equal. 

Finally, note that the database tools aren’t available 
in just any version of VS 2010. The features I discussed 
here are included in only the higher-end versions of 
VS 2010, such as Premium Edition and Ultimate Edi- 
tion. In particular, no database tools are available in 
VS 2010 Professional Edition and Express Edition. 
(For more information about VS 2010’s editions, see 
the web-exclusive sidebar “VS 2010 Flavors,” www 
.sqlmag.com, InstantDoc ID 103679.) 


LINQ-to-SQL Refined 

When LINQ-to-SQL was introduced in VS 2008, 
it was considered to be a dead end after only a few 
months on the market. However, LINQ-to-SQL is 
available in VS 2010, and it includes a long list of bug 
fixes and some minor adjustments, but you shouldn’t 
expect smart, creative enhancements. Primarily cre- 
ated for web developers and websites, LINQ-to-SQL 
is just right for small-to-midsized businesses (SMBs) 
that don’t always need the abstraction and power of 
a true object-relational mapping (O/RM) framework, 
such as the Entity Framework. Many companies pre- 
fer to use LINQ-to-SQL because it’s smaller, faster, 
and simpler than the Entity Framework, but it goes 
without saying that LINQ-to-SQL isn’t as powerful in 
terms of entity-relationship design database support, 
which is limited to SQL Server. 

Some changes to LINQ-to-SQL were made in VS 
2010. First, when a foreign key undergoes changes in 
the database schema, simply re-dragging the table- 
based entity into the designer will refresh the model. 
Second, the new version of LINQ-to-SQL will pro- 
duce T-SQL queries that are easier to cache. SQL 
Server makes extensive use of query plans to optimize 
the execution of queries. A query execution plan is 
reused only if the next query exactly matches the pre- 
vious one that was prepared earlier and cached. Before 
VS 2010 and the .NET Framework 4.0, the LINQ-to- 
SQL engine produced queries in which the length of 
variable type parameters, such as varchar, nvarchar, 
and text, wasn’t set. Subsequently, the SQL Server 
client set the length of those fields to the length of the 
actual content. As a result, very few query plans were 
actually reused, creating some performance concerns 
for DBAs. In the LINQ-to-SQL that comes with 
.NET 4.0 and VS 2010, text parameters are bound to 
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queries as nvarchar(4000), or as nvarchar(MAX) if 
the actual length is greater than 4000. The following 
is the resulting query that hits SQL Server when you 
target .NET 3.5 in VS 2010: 


exec sp_executesq] N'SELECT TOP (1) 
[tø] . [CustomerID] 

FROM [dbo].[Customers] AS [tø] 

WHERE [tø].[City] = @p@' ,N'@pd 
nvarchar(6)',@p@=N'London' 


And here’s the resulting query when you target the 
.NET Framework 4.0 in VS 2010: 


exec sp_executesq] N'SELECT TOP (1) 
[tø] . [CustomerID] 

FROM [dbo]. [Customers] AS [tø] 

WHERE [tø].[City] = @p@' ,N'@pd 
nvarchar (4Ø0øø) ' ,@pøð=N' London ' 


Also, VS 2010’ LINQ-to-SQL supports server- 
generated columns and multiple active result sets. 

LINQ-to-SQL still doesn’t support complex 
types or any mapping that’s more sophisticated 
than a property-to-column association. In addi- 
tion, it doesn’t support automatic updates of the 
model when the underlying schema changes. And 
although LINQ-to-SQL works only with SQL 
Server, it still doesn’t recognize the new specific 
data types introduced in SQL Server 2008, such as 
spatial data. 


The Entity Framework 4.0 
LINQ-to-SQL is the low-end framework for data- 
driven applications and, in a way, it’s the simplest 
object-oriented replacement for ADO.NET. ADO 
.NET is still an available framework, but it mostly 
serves as the underlying API for more abstract and 
object-oriented APIs such as LINQ-to-SQL and its 
big brother—the Entity Framework. Positioned as 
the comprehensive framework to fulfill the expecta- 
tions of enterprise architects and developers, the 
Entity Framework enables serious domain model 
design in the .NET space. 

In VS 2010, the Entity Framework offers two 
distinct programming models—data-first and 
model-first. With the data-first model, you pick up 
an existing database connection and ask the designer 
to build an object model based on that connection. 
The model-first programming model lets you create 
an abstract entity-relationship model within the 
designer and offers to convert the DDL to physical 
tables. 

This version of the Entity Framework includes 
several external VS 2010 extensions to customize the 
database generation and update process. In particular, 
I recommend checking out the Entity Designer 
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Database Generation Power Pack (visualstudiogallery 
-msdn.microsoft.com/en-us/df3541c3-d833-4b65- 
b942-989e7ec74c87), which lets you customize the 
workflow for generating the database from the model. 
The Database Generation Power Pack is a useful tool 
that meets the needs of developers who are willing to 
work at a higher level of abstraction with objects and 
object-oriented tools, and DBAs who are concerned 
about the stability, structure, and performance of the 
database. This new tool gives you control over the 
logic that maps an abstract model to a concrete list of 
tables, views, and indexes. The default wizard in VS 
2010 uses a fixed strategy—one table per entity. You 
can choose from six predefined workflows or create a 
custom workflow using either the T4 template-based 
language of VS or Windows Workflow. Web Figure 2 
shows the list of predefined workflows. 

The Entity Framework 4.0 introduces a significant 
improvement in the process that generates C# (or 
Visual Basic) code for the abstract model. The default 
generation engine produces a class hierarchy rooted 
in the Entity Framework library classes. In addition, 
the Entity Framework includes two more generation 
engines: the Plain Old CLR Object (POCO) genera- 
tor and the self-tracking entity generator. The first 
engine generates classes that are standalone with 
no dependencies outside the library itself. The self- 
tracking entity engine generates POCO classes that 
contain extra code so that each instance of the entity 
classes can track their own programmatic changes. 

When the first version of the Entity Framework 
came out in the fall of 2008, some prominent members 
of the industry signed a document to express their 
lack of confidence with the product. The new version 
of Entity Framework fixes all the points mentioned in 
that document, and it includes even more features. If 
you passed up the first version of the Entity Frame- 
work because you didn’t think it was adequate for 
your needs, you might want to give it a second chance 
with VS 2010. 


Future Enhancements 

VS 2010 contains so many features and frameworks 
that it’s hard to summarize the product in the space 
of an article. The new version offers new database 
development testing and editing features, as well as 
new UI tools around the Entity Framework. 

So what you can expect after the release of VS 
2010? VS is the central repository of development 
tools, frameworks, and libraries, and it’s pluggable 
and extensible. Therefore, you should expect Micro- 
soft to provide new designers and wizards to fix 
incomplete tools or to add cool new features that 
just weren’t ready in time for the VS 2010 launch. 
Developing for .NET is a continuous process in 
constant evolution. SQL 
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Optimizing 


Per Group Queries 


Compare solutions based on ROVV_NUMBER, APPLY, 


and concatenation 


querying task can have several solutions, 

and one solution won't perform best in all 

cases. Several factors can influence which 
solution performs best, including data density, avail- 
ability of indexes, and so on. To decide on a solution, 
you must know your data, understand how SQL 
Server’s optimizer works, and determine whether it’s 
worthwhile to add important indexes in terms of overall 
impact on the system. In this article I present a simple, 
common querying task; then I show three solutions, 
and I discuss when each solution would work best. 


Challenge 

The task that’s the focus of this article is commonly 
referred to as Top N Per Group. The idea is to return a 
top number of rows for each group of rows. Suppose 
you have a table called Orders with attributes called 
orderid, custid, empid, shipperid, orderdate, shipdate, 
and filler (representing other attributes). You also 
have related tables called Customers, Employees, and 
Shippers. You need to write a query that returns the 
N most recent orders per customer (or employee or 
shipper). “Most recent” is determined by order date 
descending, then by order ID descending. 

Writing a workable solution isn’t difficult; the chal- 
lenge is determining the optimal solution depending 
on variable factors and what's specific to your case in 
your system. Also, assuming you can determine the 
optimal index strategy to support your solution, you 
must consider whether you're allowed and if you can 
afford to create such an index. Sometimes one solu- 
tion works best if the optimal index to support it is 
available, but another solution works better without 
such an index in place. In our case, the density of the 
partitioning column (e.g., custid, if you need to return 
the top rows per customer) also plays an important 
role in determining which solution is optimal. 


Sample Data 
Run the code in Listing 1 to create a sample database 
called Test TOPN and within it a helper function called 
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GetNums. The helper function accepts a number as 
input and returns a sequence of integers in the range 
1 through the input value. The function is used by the 
code in Listing 2 to populate the different tables with 
a requested number of rows. Run the code in Listing 2 
to create the Customers, Employees, Shippers, and 
Orders tables and fill the tables with sample data. 
The code in Listing 2 populates the Customers table 
with 50,000 rows, the Employees table with 400 rows, 
the Shippers table with 10 rows, and the Orders table 
with 1,000,000 rows (approximately 240-byte row size, 
for a total of more than 200MB). To test your solu- 
tions with other data distribution, change the variables’ 
values. With the existing numbers, the custid attribute 
represents a partitioning column with low density 
(large number of distinct custid values, each appearing 
a small number of times in the Orders table), 
whereas the shipperid attribute represents a 
partitioning column with high density (small 
number of distinct shipperid values, each 
appearing a large number of times). 
Regardless of the solution you use, if you 
can afford to create optimal indexing, the best approach 
is to create an index on the partitioning column (e.g., 
custid, if you need top rows per customer), plus the 
ordering columns as the keys (orderdate DESC, orderid 
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LISTING I: Code to Create Sample Database 
Test TOPN and Helper Function GetNums 


SET NOCOUNT ON; 


IF DB_ID('TestTOPN') IS NULL CREATE DATABASE TestTOPN; 


GO 
USE TestTOPN; 


IF OBJECT_ID('dbo.GetNums', 'IF') IS NOT NULL DROP FUNCTION dbo.GetNums; 


GO 


CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE 


LØ AS(SELECT 1 AS 
L1  AS(SELECT 1 AS 
L2 AS(SELECT 1 AS 
L3 AS(SELECT 1 AS 
L4 AS(SELECT 1 AS 
L5  AS(SELECT 1 AS c 


UNION ALL SELECT 1), 


AOA AA 


FROM LØ AS A CROSS JOIN LØ AS B), 
FROM L1 AS A CROSS JOIN L1 AS B), 
FROM L2 AS A CROSS JOIN L2 AS B), 
FROM L3 AS A CROSS JOIN L3 AS B), 
FROM L4 AS A CROSS JOIN L4 AS B), 


Nums ASC(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5) 


SELECT TOP(@n) n FROM Nums ORDER BY n; 
GO 
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LISTING 2: Code to Create and Populate Tables with 


Sample Data 


DECLARE 
@num_orders AS INT 5 
@num_customers AS INT s 
@num_employees AS INT 4 
@num_shippers AS INT 5 
@start_orderdate AS DATETIME, 
@end_orderdate AS DATETIME; 


SELECT 
@num_orders = 1000000 , 
@num_customers = 50000 , 
@num_employees = 490, 
@num_shippers = 16, 
@start_orderdate = '29%69191' 
@end_orderdate = '29199531'; 


IF OBJECT_IDC'dbo.Orders' , 'U') IS NOT NULL DROP TABLE dbo.Orders; 

IF OBJECT_IDC'dbo.Customers', 'U') IS NOT NULL DROP TABLE dbo.Customers; 
IF OBJECT_ID('dbo.Employees', 'U') IS NOT NULL DROP TABLE dbo.Employees; 
IF OBJECT_ID('dbo.Shippers' , 'U') IS NOT NULL DROP TABLE dbo.Shippers; 


-- Customers 
CREATE TABLE dbo.Customers 


custid INT NOT NULL, 
custname VARCHAR(5@) NOT NULL, 
filler CHAR(2@@) NOT NULL DEFAULT('a'), 
CONSTRAINT PK_Customers PRIMARY KEY(custid) 


INSERT INTO dbo.Customers WITH (TABLOCK) (custid, custname) 


SELECT n, 'Cust ' + CAST(n AS VARCHAR(18)) FROM dbo.GetNums(@num_customers) ; 


-- Employees 
CREATE TABLE dbo.Employees 
K 


empid INT NOT NULL, 

empname VARCHAR(5Ø) NOT NULL, 

filler CHAR(299) NOT NULL DEFAULT('a'), 

CONSTRAINT PK_Employees PRIMARY KEY(empid) 
); 


INSERT INTO dbo.Employees WITH (TABLOCK) (empid, empname) 


SELECT n, 'Emp ' + CAST(n AS VARCHAR(1Ø)) FROM dbo.GetNums(@num_employees) ; 


-- Shippers 

CREATE TABLE dbo.Shippers 

G 

shipperid INT NOT NULL, 

shippername VARCHAR(5@) NOT NULL, 

filler CHAR (200) NOT NULL DEFAULT('a'), 
CONSTRAINT PK_Shippers PRIMARY KEY(shipperid) 
5 


INSERT INTO dbo.Shippers WITH (TABLOCK) (shipperid, shippername) 
SELECT n, ‘Shipper ' + CAST(n AS VARCHAR(1@)) FROM dbo.GetNums (@num_ 
shippers); 


-- Orders 
CREATE TABLE dbo.Orders 
¢ 


orderid INT NOT NULL, 

custid INT NOT NULL, 

empid INT NOT NULL, 

shipperid INT NOT NULL, 

orderdate DATETIME NOT NULL, 

shipdate DATETIME NULL, 

filler CHAR(2@8) NOT NULL DEFAULT('a'), 


CONSTRAINT PK_Orders PRIMARY KEY(orderid), 
5 


WITH C AS 
¢ 


SELECT 
n AS orderid, 
ABS(CHECKSUM(NEWID())) % @num_customers + 1 AS custid, 
ABS (CHECKSUM(NEWID())) % @num_employees + 1 AS empid, 
ABS(CHECKSUM(NEWID())) % @num_shippers + 1 AS shipperid, 
DATEADD (day , 
ABS (CHECKSUM(NEWID() ) ) 
% (DATEDIFF(day, @start_orderdate, @end_orderdate) + 1), 
@start_orderdate) AS orderdate, 
ABS (CHECKSUM(NEWID())) % 31 AS shipdays 
FROM dbo.GetNums (@num_orders) 
) 


INSERT INTO dbo.Orders WITH C(TABLOCK) (orderid, custid, empid, shipperid, 


orderdate, shipdate) 
SELECT orderid, custid, empid, shipperid, orderdate, 
CASE 
WHEN DATEADD(day, shipdays, orderdate) > @end_orderdate THEN NULL 
ELSE DATEADD(day, shipdays, orderdate) 
END AS shipdate 
FROM C; 
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DESC), and include in the index definition all the other 
attributes you need to return from the query (eg., 
filler). As I mentioned, I’ll use custid as the partitioning 
column to demonstrate queries with low density and 
shipperid for high density. To demonstrate solutions 
when an index is available, you must run the following 
commands to create the indexes before testing: 


CREATE UNIQUE NONCLUSTERED INDEX idx_unc_sid_ 
odD_oidD_Ifiller 
ON dbo.Orders(shipperid, orderdate DESC, 
orderid DESC) 
INCLUDE (fi1 ler) ; 


CREATE UNIQUE NONCLUSTERED INDEX idx_unc_cid_ 
odD_oidD_Ifiller 
ON dbo.Orders(custid, orderdate DESC, orderid 
DESC) 
INCLUDE (filler); 


To demonstrate solutions when the optimal index isn’t 
available, use the following code to drop the indexes: 


DROP INDEX 
dbo.Orders.idx_unc_sid_odD_oidD_Ifiller, 
dbo.Orders.idx_unc_cid_odD_oidD_Ifiller; 


My test machine has an Intel Core 17 processor 
(quad core with hyperthreading, hence eight logical 
processors), 4GB of RAM, and a 7,200RPM hard 
drive. I tested three solutions, based on the ROW_ 
NUMBER function, the APPLY operator, and 
grouping and aggregation of concatenated elements. 


Solution Based on 
ROW_NUMBER 

Listing 3 contains the solution based on the ROW_ 
NUMBER function. This solution assumes that the 
request was for the most recent order for each cus- 
tomer. The concept is straightforward—assign row 
numbers partitioned by custid, ordered by orderdate 
DESC, orderid DESC, and filter only the rows in 
which the row number is equal to 1. 

Some benefits to this solution aren’t related to per- 
formance. For example, the solution is simple and can 
easily be tailored to work with N > 1 (simply change 
the filter to rownum <= N—e.g., rownum <= 3). 

As for performance, Figure 1 shows the plan for 
this query when the optimal index is available—in 


q paralleli Te i 
arallelism P 
SELECT Filter ; 
Cost: 0 % (Gather Streams) Cost: 0% (Dist 
Cost: 0 % 
Figure 2 


Execution plan for solution based on ROVWW_NUMBER 
without index 
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n May 2010, Microsoft is releasing SQL Server 

2008 R2. More than just an update to SQL 

Server 2008, R2 is a brand new version. This 

essential guide will cover the major changes in 
SQL Server 2008 R2 that DBAs need to know about. 
For additional information about any SQL Server 
2008 R2 features not covered, see http://www 
.microsoft.com/sqlserver. 


Editions and Licensing 

SQL Server 2008 R2 introduces two new editions: 
Datacenter and Parallel Data Warehouse. | will 
discuss Parallel Data Warehouse later, in the 
section “BI enhancements in SQL Server 2008 R2.” 

SQL Server 2008 R2 Datacenter builds on the 
value delivered in Enterprise and is designed 
for those who need the ultimate flexibility for 
deployments, as well as advanced scalability for 
things such as virtualization and large applications. 
For more information about scalability, see the 
section “Ready for Mission-Critical Workloads.’ 

SQL Server 2008 R2 also will include an increase 
in price, which might affect your purchasing and 
licensing decisions. Microsoft is increasing the 
per-processor pricing for SQL Server 2008 R2 by 
15 percent for the Enterprise Edition and 25 percent 
for the Standard Edition. There will be no change to 
the price of the server and CAL licensing model. The 
new retail pricing is shown in Table 1. 

Licensing is always a key consideration when 
you are planning to deploy a new version of SQL 
Server. Two things have not changed when it 
comes to licensing SQL Server 2008 R2: 


e The per server/client access model pricing 
for SQL Server 2008 R2 Standard and 
Enterprise has not changed. If you use 
that method for licensing SQL Server 
deployments, the change in pricing for the 
per-processor licensing will not affect you. 
That’s a key point to take away. 


e SQL Server licensing is based on sockets, 
not cores. A socket is a physical processor. 
For example, if a server has four physical 
processors, each with eight cores, and 
you choose Enterprise, the cost would be 
$114,996. If Microsoft charged per core, the 


Table 1: SQL Server 2008 R2 retail pricing 


Per Processor 


Per Server + Client 


(price in US Access Licenses (price 
dollars) in US dollars) 
Standard $7,499 $1,849 with 5 CALs 
Enterprise $28,749 $13,969 with 25 CALs 
Datacenter $57,498 Not available 
Parallel Data $57,498 Not available 
Warehouse 


cost would be an astronomical $919,968. 
Some of Microsoft's competitors do charge 
per core, so keep that in mind as you decide 
on your database platform and consider 
what licensing will cost over its lifetime. 


Buy Software Assurance (SA) now to lock in 
your existing pricing structure before SQL Server 
2008 R2 is released, and you'll avoid paying the 
increased prices for SQL Server 2008 R2 Standard 
and Enterprise licenses. If you're planning to 
upgrade during the period for which you have 
purchased SA, you'll save somewhere between 
15 percent and 25 percent. If you purchase SA, 
you also will be able to continue to use unlimited 
virtualization when you decide to upgrade to 
SQL Server 2008 R2 Enterprise Edition. 

One change that you should note: In the past, 
the Developer edition of SQL Server has been a 
developer-specific version that contained the same 
specifications and features as Enterprise. For SQL 
Server 2008 R2, Developer now matches Datacenter. 


Upgrading to SQL Server 2008 R2 
SQL Server 2000, SQL Server 2005, or SQL Server 
2008 can all be upgraded to SQL Server 2008 
R2. The upgrade process is similar to the one 
documented for SQL Server 2008 for standalone 
or clustered instances. A great existing resource 
you can use is the SQL Server 2008 Upgrade 
Technical Reference Guide, available for download 
at Microsoft.com. 

You can augment this reference with the 
SQL Server 2008 R2-specific information in SQL 
Server 2008 R2 Books Online. Useful topics to 
read include “Version and Edition Upgrades” 
(documents what older versions and editions 
can be upgraded to which edition of SQL Server 
2008 R2), “Considerations for Upgrading the 
Database Engine,’ and “Considerations for Side- 
by-Side Instances of SQL Server 2008 R2 and SQL 
Server 2008.’ As with SQL Server 2008, it's highly 
recommended that you run the Upgrade Advisor 
before you upgrade to SQL Server 2008 R2. The tool 
will check the viability of your existing installation 
and report any known issues it discovers. 


BI Enhancements in SQL Server 2008 R2 
This guide focuses on the relational side of 

SQL Server 2008 R2, but Microsoft has also 
enhanced the features used for BI. Some of those 
improvements include: 


e SQL Server 2008 R2 Parallel Data Warehouse, 
shipping later in 2010, makes data 
warehousing more cost effective. It can 
manage hundreds of terabytes of data and 
deliver stellar performance with parallel 
data movement and a scale out architecture. 


With the capability to leverage commodity 
hardware and the existing tools, Parallel Data 
Warehouse (formerly known as “Madison”) 
becomes the center of a BI deployment and 
allows many different sources to connect to it, 
much like a hub-and-spoke system. Figure 1 
shows an example architecture. 


e SQL Server 2008 R2 Enterprise and Datacenter 
ship with Master Data Services, which enables 
master data management (MDM) for an 
organization to create a single source of “the 
truth,’ while securing and enabling easy access 
to the data. To briefly review MDM and its 
benefits, consider that you have lots of internal 
data sources, but ultimately you need one 
source that is the master copy of a given set 
of data. Data is also interrelated (for example, 
a Web site sells a product from a catalog, but 
that is ultimately translated into a transaction 
where an item is purchased, packed, and sent; 
that requires coordination). Such things as 
reporting might be tied in as well—essentially 
encompassing both the relational and analytic 
spaces. These concepts and the tools to enable 
them are known as MDM. 


e PowerPivot for SharePoint 2010 is a combination 
of client and server components that lets end 


users use a familiar tool—Excel—to securely 
and quickly access large internal and external 
multidimensional data sets. PowerPivot brings 
the power of BI to users and lets them share the 
information through SharePoint, with automatic 
refresh built in. No longer does knowledge need 
to be siloed. For more information, go to http:// 
www.powerpivot.com/. 


e Reporting Services has been greatly 
enhanced with SQL Server 2008 R2, including 
improvements to reports via Report Builder 
3.0 (e.g., ways to report using geospatial data 
and to calculate aggregates of aggregates, 
indicators, and more). 


Consolidation and Virtualization 
Consolidation of existing SQL Server instances 

and databases can take on various flavors and 
combinations, but the two main categories are 
physical and virtual. More and more organizations 
are choosing to virtualize, and choosing the right 
edition of SQL Server 2008 R2 can lower your costs 
significantly. A SQL Server 2008 Standard license 
comes with one virtual machine license; Enterprise 
allows up to four. By choosing Datacenter and 
licensing an entire server, you get the right to deploy 
as many virtual machines as you want on that server, 


Figure 1: Example of a Parallel Data Warehouse implementation showing the hub and its spokes 
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whether or not the virtualization platform is from 
Microsoft. And as mentioned earlier, if you purchase 
SA for an existing SQL Server 2008 Enterprise license, 
you can enjoy unlimited virtualization on a server 
with SQL Server 2008 R2. However, this right does 
not extend to any major SQL Server Enterprise 
release beyond SQL Server 2008 R2. So consider 
your licensing needs for virtualization and plan 
accordingly for future deployments that are not 
upgrades covered by SA. 

Live Migration is one of the most exciting new 
features of Windows Server 2008 R2. It’s built 
upon the Windows failover clustering feature and 
provides the capability to move a Hyper-V virtual 
machine from one node of the cluster to another 
with zero downtime. SQL Server's performance 
will degrade briefly during the migration, but 
applications and end users will never lose their 
connection, nor will SQL Server stop processing 
transactions. That capability is a huge leap forward 
for virtualization on Windows. Live Migration 
is for planned failovers, such as when you do 
maintenance on the underlying cluster node; but 
it allows 100 percent uptime in the right scenarios. 
Although its performance will temporarily be 
affected in the switch of the virtual machine from 
one server to another, SQL Server continues to 
process transactions. Everyone may not need this 
functionality, but Live Migration is a compelling 
reason to consider Hyper-V over other platforms 
since it ships as part of Windows Server 2008 R2. 

For deployments of SQL Server, SQL Server 2008 
R2 supports the capability to SysPrep installations 
for an installation on a standalone server or virtual 
machine. This means that you can easily and 
consistently standardize and deploy SQL Server 
configurations. And ensuring that each SQL Server 
instance looks, acts, and feels the same to you as the 
DBA improves your management experience. 

Note that if you're using a non-Microsoft platform 
for virtualizing SQL Server, it must be listed as part of 
the Server Virtualization Validation Program (http:// 
www.windowsservercatalog.com/svvp.aspx). Check 
this list to ensure that the version of the vendor's 
hypervisor is supported. Also make sure to check 
Knowledge Base article 956893 (http://support 
-microsoft.com/kb/956893), which outlines how SQL 
Server is supported in a virtualized environment. 


Management 

SQL Server 2008 introduced two key management 
features: Policy-Based Management and Resource 
Governor. Policy-Based Management lets you 
define a set of rules (a policy) based on a specific 
set of conditions (the condition). A condition is 
made up of facets (e.g., Database); it then has a 
bunch of properties (such as AutoShrink) that can 
be evaluated. Consider this example: As DBA, you 


do not want AutoShrink enabled on any database. 
AutoShrink can be queried for a database, and if 

it is enabled, it can either just be reported back as 
such and leave you to perform an action manually 

if desired, or if a certain condition is met (such as 
AutoShrink being enabled) and that is not your 
desired condition, it can be disabled automatically 
once it has been detected. The choice is up to you 
as the implementer of the policy. You can then roll 
out the policy and enforce it for all of the SQL Server 
instances in a given environment. With PowerShell, 
you can even use Policy-Based Management to 
manage SQL Server 2000 and SQL Server 2005. It’s a 
great feature that you can use to enforce compliance 
and standards in a straightforward way. 

Resource Governor is another feature that is 
important in a post-consolidated environment. It 
allows you as DBA to ensure that a workload will 
not bring an instance to its knees, by defining CPU 
and memory parameters to keep it in check— 
effectively stopping things such as runaway queries 
and unpredictable executions. It can also allow a 
workload to get priority over others. For example, 
let's say an accounting application shares an 
instance with 24 other applications. You observe 
that, at the end of the month, performance for the 
other applications suffers because of the nature 
of what the accounting application is doing. You 
can use Resource Governor to ensure that the 
accounting application gets priority, but that it 
doesn’t completely starve the other 24 applications. 
Be aware that Resource Governor cannot throttle 
I/O in its current implementation and should only 
be configured if there is a need to use it. If it’s 
configured where there is no problem, it could 
potentially cause one; so use Resource Governor 
only if needed. 

SQL Server 2008 R2 also introduces the SQL Server 
Utility. The SQL Server Utility gives you as DBA 
the capability to have a central point to see what 
is going on in your instances, and then use that 
information for proper planning. For example, in the 
post-consolidated world, if an instance is either over- 
or under-utilized, you can handle it in the proper 
manner instead of having the problems that existed 
pre-consolidation. Before SQL Server 2008 R2, the 
only way to see this kind of data in a single view was 
to code a custom solution or buy a third-party utility. 
Trending and tracking databases and instances 
was potentially very labor and time intensive. To 
address this issue, the SQL Server Utility is based 
on setting up a central Utility Control Point that 
collects and displays data via dashboards. The Utility 
Control Point builds on Policy-Based Management 
(mentioned earlier) and the Management Data 
Warehouse feature that shipped with SQL Server 
2008. Setting up a Utility Control Point and enrolling 
instances is a process that takes minutes—not hours, 


days, or weeks, and large-scale deployments can utilize or package, which contains all the database objects 
PowerShell to enroll many instances. SQL For example, needed for an application that you can create from 


Server 2008 R2 Enterprise allows you to manage up existing applications or in Visual Studio if you are a 

to 25 instances as part of the SQL Server Utility, and developer. You will then use that package to deploy 

Datacenter has no restrictions. the database portion of the application via a standard 
Besides the SQL Server Utility, SQL Server 2008 process instead of having a different deployment 

R2 introduces the concept of a data-tier application method for each application. One of the biggest 


(DAC). A DAC represents a single unit of deployment, benefits of the DAC is that you can modify it at any 


Figure 2: Sample SQL Server Utility architecture 
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Table 2: The major differences between Standard, Enterprise, 


and Datacenter editions 


ratio. You can find the full results and 
information about TPC-E at http:// 


ee Sars in| once 


Max Memory 64GB 


by Windows version 


Max CPU (licensed 
per socket, not core) 


Asockets 8 sockets 


time in Visual Studio 2010 (for example, adding a 
column to a table). In addition, you can redeploy 
the package using the same standard process 
as the initial deployment in SQL Server 2008 R2, 
with minimal to no intervention by IT or the DBAs. 
This capability makes upgrades virtually painless 
instead of a process fraught with worry. Certain 
tasks, such as backup and restore or moving data, 
are not done via the DAC; they are still done at 
the database. You can view and manage the DAC 
via the SQL Server Utility. Figure 2 shows what a 
sample SQL Server Utility architecture with a DAC 
looks like. 

IMPORTANT: Note that DAC can also refer to 
the dedicated administrator connection, a feature 
introduced in SQL Server 2005. 


Ready for Mission-Critical Workloads 
Mission-critical workloads (including things such as 
consolidation) require a platform that has the right 
horsepower. The combination of Windows Server 
2008 R2 and SQL Server 2008 R2 provides the best 
performance-to-cost ratio on commodity hardware. 
Table 2 highlights the major differences between 
Standard, Enterprise, and Datacenter. 

With Windows Server 2008 R2 Datacenter, SQL 
Server 2008 R2 Datacenter can support up to 256 
logical processors. If you're using the Hyper-V 
feature of Windows Server 2008 R2, up to 64 logical 
processors are supported. That’s a lot of headroom 
to implement physical or virtualized SQL Server 
deployments. SQL Server 2008 R2 also has support 
for hot-add memory and processors. As long as your 
hardware supports those features, your SQL Server 
implementations can grow with your needs over 
time, instead of your having to overspend when you 
are initially sizing deployments. 

SQL Server 2008 R2 Datacenter achieved a new 
world record of 2,013 tpsE (tps = transactions per 
second) on a 16-processor, 96-core server. This 
record used the TPC-E benchmark (which is closer 
to what people do in the real world than TPC-C). 
From a cost perspective, that translates to $958.23 
per tpsE with the hardware configuration used for 
the test, and it shows the value that SQL Server 2008 
R2 brings to the table for the cost-to-performance 


Maximum memory supported 


Maximum memory supported 
by Windows version 


Note that Windows Server 2008 R2 
is 64-bit only, so now is a good time 
to start transitioning and planning 
deployments to 64-bit Windows and 
SQL Server. If for some reason you still 
require a 32-bit (x86) implementation 
of SQL Server 2008 R2, Microsoft is still shipping 
a 32-bit variant. If an implementation is going 
to use a 32-bit SQL Server 2008 R2 instance, it’s 
recommended that you deploy it with the last 32-bit 
server operating system that Microsoft will ever ship: 
the original (RTM) version of Windows Server 2008. 
Performance is not the only aspect that makes a 
platform mission critical; it must also be available. 
SQL Server 2008 R2 has the same availability features 
as its predecessor, and it can give your SQL Server 
deployments the required uptime. Add to that mix 
the Live Migration feature for virtual machines, 
mentioned earlier; and there are quite a few 
methods to make instances and databases available, 
depending on how SQL Server is deployed in your 
environment. 


Don’t Overlook SQL Server 2008 R2 

SQL Server 2008 R2 is not a minor point release of 
SQL Server and should not be overlooked; it is a new 
version full of enhancements and features. If you are 
an IT administrator or DBA, or you are doing BI, SQL 
Server 2008 R2 should be your first choice when it 
comes to choosing a version of SQL Server to use 
for future deployments. By building on the strong 
foundation of SQL Server 2008, SQL Server 2008 R2 
allows you as the DBA to derive even more insight 
into the health and status of your environments 
without relying on custom code or third-party 
utilities—and there is an edition that meets your 
performance and availability needs for mission- 
critical work, as well. 


Allan Hirt has been using SQL Server in various 
guises since 1992. For the past 10 years, he has been 
consulting, training, developing content, speaking 
at events, and authoring books, whitepapers, and 
articles. His most recent major publications include 
the book Pro SQL Server 2005 High Availability 
(Apress, 2007) and various articles for SQL Server 
Magazine. Before striking out on his own in 2007, 
he most recently worked for both Microsoft and 
Avanade, and still continues to work closely with 
Microsoft on various projects. He can be reached via 
his website at http://www.sqlha.com or at allan@ 
sqlha.com. 
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Execution plan for solution based on ROW_NUMBER with index 


both low-density and high-density cases—and 
Figure 2 shows the plan when the optimal index isn’t 
available. When an optimal index exists (i.e., supports 
the partitioning and ordering needs as well as covers 
the query), the bulk of the cost of the plan is associ- 
ated with a single ordered scan of the index. 

Note that when partitioning is involved in the 
ranking calculation, the direction of the index col- 
umns must match the direction of the columns in 
the calculations ORDER BY list for the optimizer 
to consider relying on index ordering. This means 
that if you create the index with the keylist (custid, 
orderdate, orderid) with both orderdate and orderid 
ascending, you'll end up with an expensive sort in 
the plan. Apparently, the optimizer wasn’t coded to 
realize that a backward scan of the index could be 
used in such a case. 

Returning to the plan in Figure 1, very little cost 
is associated with the actual calculation of the row 
numbers and the filtering. No expensive sorting is 
required. Such a plan is ideal for a low-density sce- 
nario (7.7 seconds on my system, mostly attributed 
to the scan of about 28,000 pages). This plan is also 
acceptable for a high-density scenario (6.1 seconds 
on my system); however, a far more efficient solution 
based on the APPLY operator is available. 

When an optimal index doesn’t exist, the ROW_ 
NUMBER solution is very inefficient. The table is 
scanned only once, but there’s a high cost associated 
with sorting, as you can see in Figure 2. 


Solution Based on APPLY 

Listing 4 shows an example for the solution based on 
the APPLY operator, with custid as the partitioning 
column and 1 as the number of most recent orders to 
return per partition. This solution queries the table 
representing the partitioning entity (custid) and uses 
the APPLY operator to apply a query that returns the 
TOP(N) most recent orders for each outer row. This 
solution is straightforward and can easily be tailored 
to support multiple rows per partition (simply adjust 
the input to the TOP option). 


LISTING 3: Solution Based on ROW_NUMBER 


WITH C AS 
¢ 


SELECT custid, orderdate, orderid, filler, 
ROW_NUMBER() OVER(PARTITION BY custid 
ORDER BY orderdate DESC, orderid DESC) AS rownum 
FROM dbo.Orders 


SELECT custid, orderdate, orderid, filler 
FROM C 
WHERE rownum = 1; 


LISTING 4: Solution Based on APPLY 


SELEC Acn 
FROM dbo.Customers AS C 
CROSS APPLY (SELECT TOP (1) custid, orderdate, orderid, filler 
FROM dbo.Orders AS 0 
WHERE O.custid = C.custid 
ORDER BY orderdate DESC, orderid DESC) AS A; 


LISTING 5: Solution Based on Concatenation 


WITH C AS 


SELECT 
custid, 
MAX( (CONVERT (CHAR(8), orderdate, 112) 
+ RIGHT('ØØØØØØØØØ' + CASTCorderid AS VARCHAR(19)), 19) 
+ filler) COLLATE Latinl_General_BIN2) AS string 
FROM dbo.Orders 
GROUP BY custid 


J 
SELECT custid, 


CAST(SUBSTRING(string, 1, 8) AS DATETIME ) AS orderdate, 

CAST(SUBSTRING(string, 9, 10) AS INT ) AS orderid, 

CAST(SUBSTRING(string, 19, 2@@) AS CHAR(2@@)) AS filler 
FROM C; 


As for performance, Figure 3 shows the plan with 
the optimal index, for both low-density and high- 
density cases. Figure 4 shows the plan when the index 
isn’t available; note the different plans for low-density 
(upper plan) and high-density (lower plan) cases. 

You can see in the plan in Figure 3 that when 
an index is available, the plan scans the table rep- 
resenting the partitioning entity, then for each 
row performs a seek and partial scan in the index 
on the Orders table. This plan is highly efficient 
in high-density cases because the number of 
seek operations is equal to the number of parti- 
tions; with a small number of partitions, there’s a 
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small number of seek operations. When I ran this solution will be cheaper. As an example, when using 
solution for the high-density case (with shipperid custid as the partitioning column, the solution in 
instead of custid), when the index was available, I got Listing 4 involves more than 350,000 reads. 

only 39 logical reads—compared with about 28,000 As for the efficiency of this solution when an 
reads in the previous solution, based on ROW_ optimal index doesn’t exist, observe the plans in 
NUMBER. In the low-density case there willbe many Figure 4. In the low-density case (upper plan), for 
seek operations, and the I/O cost will become accord- each partition, the optimizer scans the clustered index 
ingly high, to a point where the ROW_NUMBER of the Orders table, creates an index (index spool), 
and applies a Top N Sort. In the high-density case 


| i ky (lower plan), the optimizer doesn’t create an index 
Nested Loops Clustered Index Scan (Clu... 
ar eee isopeed IE pee” but follows the other steps. In both cases the plans 
Cost: 1 % Cost: 40 % involve an extremely high number of reads (more 
= $ than 3,000,000 in the low-density case and more than 
Top =~ Index Seek (NonClusterq 300,000 in the high-density case), as well as the sorting 
cot: O% LOrders ]. Lidx_unc_sid_o| =. 
Cost: 59 % costs (although it’s not a full sort but rather a Top N 
Figure 3 Sort with the purpose of filtering). 
Execution plan for solution based on APPLY with index . 
P Solution Based on 
Query 1: Query cost (relative to the batch): 80% Concatenation 
SELECT A.* FROM dbo.Customers AS C CROSS APPLY (SELECT TOP (1) custid, orderdate, or š 
z Some solutions work well only when the 
E| = ~ Nested Loops "Clustered Index Scan (Clu... right indexes are available, but without 
(Inner Join) | [Customers]. PK_Customers... . 2 
Cost: 0 % | Cost: 0 % those indexes the solutions perform badly. 
En a iy Such is the case with both the solutions I 
Sort ——— Index Spool Clustered Index Scan (q : : : 
Gos NSD (Eager nool) lorders]. [pk orders] | Presented. Creating new indexes isn’t always 
Cost: 64 % Cost: 34 % Cost: 3 % . A š 
an option: Perhaps your organization has 
Query 2: Query cost (relative to the batch): 20% policies against doing so; or perhaps write 
SELECT A.* FROM dbo.Shippers AS S CROSS APPLY (SELECT TOP (1) shipperid, orderdate, 
al performance is paramount on your sys- 
A, Nested Loops — Clustered Index Scan (Clu... tems, and adding indexes adds overhead to 
Cost: 0 % (inner Join) LShippers].LPK_Shippers J i , i i 
i Cost: 0 % Cost: 0 % the writes. If creating an optimal index to 
a ky support the ROW_NUMBER or APPLY 
ee ki ee ee ee solutions isn’t an option, you can use a solu- 
Cost: 36 % Cost: 64 % 


tion based on grouping and aggregation of 


Figure 4 concatenated elements. Listing 5 presents 


Execution plan for solution based on APPLY without index such a solution. 
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WITH C AS ( SELECT custid, MAX((CONVERT(CHAR(8), orderdate, 112) + RIGHT('000000000' + cAST(orderid AS VARCHAR(10)), 10) 
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Execution plan for solution based on concatenation with index 
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Execution plan for solution based on concatenation without index 
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Duration (seconds) CPU (seconds) 
25 10 
9 
20 3 
7 
15 S 
5 
10 4 
3 
ii a. 
: With Index Without Index With Index Without Index : With Index Without Index E ader. Without Index 
Low Density High Density Low Density High Density 
m AggConcat 8.931 5.584 6.028 4.566 E AggConcat 1.95 2.948 1.402 1.84 
E Apply 12.093 10.916 0.33 6.615 E Apply 0.702 8.752 o 1.373 
E RowNum 7.708 19.474 6.111 16.804 E RowNum 0.546 4.34 0.39 5.567 
Figure 7 Figure 8 
Duration statistics CPU statistics 
The main downside of this solution is that it’s com- 
plex and unintuitive—unlike the others. The common 
table expression (CTE) query groups the data by the 
partitioning column. The query applies the MAX 
aggregate to the concatenated partitioning + ordering 
+ covered elements after converting them to a common 
form that preserves ordering and comparison behavior 
of the partitioning + ordering elements (fixed sized char- 
acter strings with binary collation). Of course, you need ace. ai mm Gee aie 
to make sure that ordering and comparison behavior is sisting PEEN 
preserved, hence the use of style 112 (YYYYMMDD) MAggconcat|_ 27872 ae a EFA 
for the order dates and the addition of leading zeros to m Apply 353996 3522262 39 304182 
the order IDs. The outer query then breaks the concat- ai kisia 27872 = sche pues 
enated strings into the individual elements and converts , 
them to their original types. Another downside of this Figure ? 
solution is that it works only with N = 1. Logical reads statistics 
As for performance, observe the plans for this 
solution, shown in Figure 5 (with index in place) and Benchmark Results 
Figure 6 (no index). With an index in place, the opti- You can find benchmark results for my three solutions 
mizer relies on index ordering and applies a stream in Figure 7 (duration), Figure 8 (CPU), and Figure 9 
aggregate. In the low-density case (upper plan), the (logical reads). Examine the duration and CPU 
optimizer uses a serial plan, whereas in the high-density statistics in Figure 7 and Figure 8. Note that when 
case (lower plan), it utilizes parallelism, splitting the an index exists, the ROW_NUMBER solution is the 
work to local (per thread) then global aggregates. These most efficient in the low-density case and the APPLY 
plans are very efficient, with performance similar to solution is the most efficient in the high-density case. 
the solution based on ROW_NUMBER; however, the Without an index, the aggregate concatenation solu- 
complexity of this solution and the fact that it works tion works best in all densities. In Figure 9 you can 
only with N = 1 makes it less preferable. see how inefficient the APPLY solution is in terms of 
Observe the plans in Figure 6 (no index in place). The T/O, in all cases but one—high density with an index 
plans scan the table once and use a hash aggregate forthe available. (In the low-density without index scenario, 
majority of the aggregate work (only aggregate operator the bar for APPLY actually extends about seven times 
in the low-density case, and local aggregate operator in beyond the ceiling of the figure, but I wanted to use a 
the high-density case). A hash aggregate doesn’t involve scale that reasonably compared the other bars.) 
sorting and is more efficient than a stream aggregate Comparing these solutions demonstrates the 
when dealing with a large number of rows. In the high- importance of writing multiple solutions for a given 
density case a sort and stream aggregate are used only task and carefully analyzing their characteristics. You 
to process the global aggregate, which involves a small must understand the strengths and weaknesses of 
number of rows to process and is therefore inexpensive. each solution, as well as the circumstances in which 
Note that in both plans in Figure 6 the bulk of the cost each solution excels. SQL 
is associated with the single scan of the data. InstantDoc ID 103663 
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Management 
in SQL Server 2008, Part 2 


Use column sets with sparse columns, and get 
better query performance with filtered indexes 


QL Server 2008 introduced new features to 

help you work more efficiently with your data. 

In “Efficient Data Management in SQL Server 
2008, Part 1” (April 2010, InstantDoc ID 103586), I 
explained the new sparse columns feature, which lets 
you save significant storage space when a column has 
a large proportion of null values. In this article, P1 
look closer at column sets, which provide a way to 
easily work with all the sparse columns in a table. PI 
also show you how you can use filtered indexes with 
both sparse and nonsparse columns to create small, 
efficient indexes on a subset of the rows in a table. 


Working with Column Sets 

A column set can add functionality to a table with sparse 

columns. As explained in part 1, a column set is an untyped 

XML column that you can use to get and set all sparse 

column values as a set. Remember that you can’t add a 

column set to an existing table that already has sparse 

columns. If you attempt to execute an ALTER TABLE 

statement on such a table, you'll get one of Microsoft’s 

new, informative error messages explaining the problem. 

You have three options to solve this problem: 

° delete and recreate the table, creating the column 
set along with the sparse columns 

° delete all the sparse columns and create an 
ALTER TABLE statement that adds them back 
along with the column set 

è create an entirely new table 


Tl choose the third option and use the code that 
Listing 1 shows to create a table called DogCS, which 
includes the Characteristics column set. The code also 
inserts data for several dogs. 

Because it has a column set, DogCS will behave 
differently from the Dog table in part 1. When you 
execute the following SELECT statement, you get the 
results that Figure 1 shows: 


SELECT * FROM DogCS 


Notice that none of the sparse columns are included 
in the result set. Instead, the contents of the column 
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set are shown as a snippet of XML data. The results 
in Figure | might be a little hard to read, so here’s the 
complete value of the Characteristics column set field 
for Mardy, the first row of the results: 


<BirthDate>1997-06-30</BirthDate> 
<Wei ght>62</Weight><OnGoingDrugs> 
Metacam, Phenylpropanolamine, 
Rehab Forte</OnGoingDrugs> 


Adding a column set to a table schema could 
break any code that uses a SELECT * and expects 
to get back individual field data rather than a 
single XML column. But no one uses SELECT * in 
production code, do they? 

You can retrieve the values of sparse columns indi- 
vidually by including the field name in the SELECT 


N 


Don Kiely 


(donkiely @ computer.org), MVP, MCSD, is 
a senior technology consultant, building 
custom applications and providing 
business and technology consulting 
services. His development work involves 
SQL Server, Visual Basic, C#, ASP.NET, and 
Microsoft Office. 


LISTING l: Creating the DogCS Table and 


Inserting Data 


CREATE TABLE DogCS 
¢ 


DogID INT NOT NULL PRIMARY KEY IDENTITY(1,1), 
Name NVARCHAR(2@) NOT NULL, 

LastRabies DATE NULL, 

Sleddog BIT NOT NULL, 

Handedness NVARCHAR(5) SPARSE NULL, 
BirthDate DATE SPARSE NULL, 

BirthYear INT SPARSE NULL, 

DeathDate DATE SPARSE NULL, 

[Weight] INT SPARSE NULL, 

Leader BIT SPARSE NULL, 

Rescue BIT NULL, 

OnGoingDrugs NVARCHAR(5@) SPARSE NULL, 
SpayNeuter BIT NULL, 


Characteristics XML COLUMN_SET FOR ALL_SPARSE_COLUMNS 


J 

INSERT INTO DogCS ([Name], 
[weight], [Rescue], [OnGoingDrugs], [SpayNeuter]) 
VALUES ('Mardy', '11/3/2005', Ø, 
'Metacam, Phenylpropanolamine, Rehab Forte', 1); 
INSERT INTO DogCS ([Name], 
[Leader], [Rescue], [SpayNeuter]) 
VALUES ('Izzi', '11/3/2985', 1, 
INSERT INTO DogCS ([Name], 
[OnGoingDrugs], [SpayNeuter]) 


2001, @, 1, 1); 


INSERT INTO DogCS ([Name], 
[Leader], [Rescue], [SpayNeuter]) 

VALUES ('Casper', '18/17/2007', 1, 2002, 1, 1, 1); 
INSERT INTO DogCS ([Name], 
[Weight], [Leader], [Rescue], [SpayNeuter]) 


INSERT INTO DogCS ([Name], 


"6/30/1997", 62), 1, 


VALUES ('Jewel', '9/23/2007', Ø, 1, 'Rehab Forte', 1); 
[LastRabies], [Sleddog], [BirthYear], 


VALUES ('Chance', '9/23/2007', 1, 2002, 36, 1, 1, 1); 
[LastRabies], [Sleddog], [BirthDate], 
[Weight], [Leader], [Rescue], [OnGoingDrugs], [SpayNeuter]) 


[LastRabies], [Sleddog], [BirthDate], 


[LastRabies], [Sleddog], [BirthYear], 


[LastRabies], [Sleddog], [Rescue], 


[LastRabies], [Sleddog], [BirthYear], 


VALUES ('Daikon', '10/17/2007', 1, '2/14/1997', 50, 1, Ø, 


"Potassium bromide, DMG', 1); 
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list. For example, the following statement produces 
the results that Figure 2 shows: 


SELECT [DogID], [Name], [LastRabies], 
[Sleddog] , [Handedness] , [BirthDate], 
[BirthYear], [DeathDate] , [Weight], 
[Leader] , [Rescue] , [OnGoingDrugs] , 
[SpayNeuter] FROM DogCS 


Similarly, you can explicitly retrieve the column set 
with the following statement: 


SELECT [DogID], [Name], [Characteristics] 
FROM DogCS 


You can also mix and match sparse columns and 
the column set in the same SELECT statement. Then 
you could access the data in different forms for use in 
client applications. The following statement retrieves 
the Handedness, BirthDate, and BirthYear sparse 
columns, as well as the column set, and produces the 
results that Figure 3 shows: 


SELECT [DogID], [Name], [Handedness], 
[BirthDate], [BirthYear], 
[Characteristics] FROM DogCS 


Column sets really become useful when you insert 
or update data. You can provide an XML snippet 
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LastRabies 

2005-11-03 
2005-11-03 
2007-09-23 
2007-10-17 
2007-09-23 
2007-10-17 


Casper 
Chance 


one w nN = 


Daikon 


wo 


a) 31S) Sf ais 


Characteristics 

<BirthD ate> 1997-06-30< /BirthD ate><Weight> B2</wei 
<BirthY ear 2001 < /BitthY ear <Leader 0</Leader> 
<OnGoingDrugs>Rehab Fortes /OnGoingDrugs> 
<BirthY ear> 2002</BirthY ear> <Leader 1</Leader> 
<BirthYear> 2002< /BirthY ear><‘weight> 36</Weight> <L. 
<BitthDate>1997-02-14</BirthD ate><Weight> 50</Wwei.. 


leddog Rescue SpayNeuter 
1 1 


ahata 


Figure | 


Results of a SELECT * statement on a table with a column set 


E Results Eà Messages | 


as the value of the column set, which populates the 
associated sparse columns. For example, the following 
code adds the information for Raja: 


INSERT INTO DogCS ([Name],[LastRabies], 
[SledDog] , [Rescue] , [SpayNeuter], 
Characteristics) 

VALUES ('Raja', '11/5/2008', 1, 1, 1, 
'<BirthDate>2004-98-30</BirthDate> 
<Wei ght>53</Weight>'); 


Instead of providing a value for each sparse column, 
the code provides a single value for the column 
set in order to populate the sparse columns, in 
this case Raja’s birthdate and weight. The fol- 
lowing SELECT statement produces the results that 
Figure 4 shows: 


SELECT [DogID], [Name], BirthDate, 
Handedness, [Characteristics] 
FROM DogCS WHERE Name = 'Raja'; 


Notice that the BirthDate field has the data from the 
column set value, but because no value was set for 
Handedness, that field is null. 

You can also update existing sparse column values 
using the column set. The following code updates 
Raja’s record and produces the results that Figure 5 
shows: 


UPDATE DogCS SET Characteristics = 
"<Weight>53</Weight><Leader>1</Leader>' 
WHERE Name = 'Raja'; 

SELECT [DogID], [Name] ,BirthDate, 

[weight] ,Leader, [Characteristics] 
FROM DogCS WHERE Name = 'Raja'; 


But what happened to Raja’s birthdate from the 
INSERT statement? The UPDATE statement set 
the field to null because the 


LastRabies 
2005-11-03 
2005-11-03 
2007-09-23 
2007-10-17 
2007-09-23 
2007-10-17 


DoglID Name 


Chance 
Daikon 


Figure 2 


leddog Handedness  BirthDate 
NULL 1997-06-30 
NULL NULL 
NULL NULL 
NULL NULL 2002 
NULL NULL 2002 
NULL 1997-02-14 NULL 


BitthYear 
NULL 
2001 
NULL 


DeathDate Weight Leader Rescue 
NULL 62 NULL 1 
NULL NULL 0 1 
NULL NULL NULL 1 
NULL 1 
NULL 1 
NULL 0 


Results of a SELECT statement that includes sparse columns in a select list 
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DoglID Name Handedness BirthDate Bithear Characteristics 

1 Mardy NULL 1997-06-30 NULL <BirthD ate> 1997-06-30</BirthD ate» <W eight> B2< /Wei. 

2 2 Izzi NULL NULL 2001 <BirthYear> 2001 </BirthY ear <Leader> 0</Leader> 

3 3 Jewel NULL NULL NULL <OnGoingDrugs> Rehab Forte</OnGoingDrugs> 

4 4 Casper NULL NULL 2002 <BirthYear> 2002< /BirthY ear <Leader>1</Leader> 

5 5 Chance NULL NULL 2002 <BirthYear> 2002</BirthY ear> <w'eight> 36</Weight> <L. 

6 6 Daikon NULL 1997-02-14 NULL <BirthDate> 1997-02-14</BirthD ate> <Weight>50</wei... 
Figure 3 


Results of including sparse columns and the column set in the same query 
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OnGoingDrugs 

Metacam, Phenylpropa.. 1 
NULL 1 
Rehab Forte 1 
NULL 1 
NULL g 
Potassium bromide, DMG 1 


code didn’t include it in the 
value of the column set. 
So you can update values 
using a column set, but it’s 
important to include all 
non-null values in the value 
of the column set field. SQL 
Server sets to null any sparse 
columns missing from the column set value. 

One limitation of updating data using column 
sets is that you can’t update the value of a column 
set and any of the individual sparse column values 
in the same statement. If you try to, yov’ll receive 
an error message: The target column list of an 
INSERT, UPDATE, or MERGE statement cannot 
contain both a sparse column and the column set that 


SpayNeuter 


contains the sparse column. Rewrite the statement 
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to include either the sparse column or the column 
set, but not both. In other words, write different 
statements to update the different fields or use a 
single statement and include all sparse fields in the 
column set. 

If you update the column set field and include 
empty tags in either an UPDATE or INSERT 
statement, you'll get the default values for each of 
those sparse columns’ data types as the non-null value 
in each column. You can mix and match empty tags 
and tags with values in the same UPDATE 
or INSERT statement; the fields with 
empty tags will get the default values, and 
the others will get the specified values. For 
example, the code in Listing 2 uses empty 
tags for several of the sparse columns in 
the table, then selects the data. As Figure 6 
shows, string fields have an empty string, 
numeric fields have zero, and date fields 
have the date 1/1/1900. None of the fields 
have null values. 

The column set field is the XML data type, so 
you can use all the methods of the XML data type 
to query and manipulate the data. For example, the 
following code uses the query() method of the XML 
data type to extract the weight node from the XML 
data: 


SELECT Characteristics.query('/Weight') 
FROM DogCS WHERE Name = 'Raja' 


Column Set Security 

The security model for a column set and its under- 
lying sparse columns is similar to that for tables 
and columns, except that for a column set it’s a 
grouping relationship rather than a container. You 
can set access security on a column set field like you 
do for any other field in a table. When accessing 
data through a column set, SQL Server checks the 
security on the column set field as you'd expect, 
then honors any DENY on each underlying sparse 
column. 

But here’s where it gets interesting. A GRANT 
or REVOKE of the SELECT, INSERT, UPDATE, 
DELETE, or REFERENCES permissions on a 
column set column doesn’t propagate to sparse 
columns. But a DENY does propagate. SELECT, 
INSERT, UPDATE, and DELETE statements 
require permissions on a column set column as well 
as the corresponding permission on all sparse col- 
umns, even if you aren’t changing a particular sparse 
column! Finally, executing a REVOKE on sparse 
columns or a column set column defaults security to 
the parent. 

Column set security can be a bit confusing. But 
after you play around with it, the choices Microsoft 
has made will make sense. 
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Name BirthDate 
Raja 2004-08-30 NULL 


<BirthD ate> 2004-08-30</BirthD ate» <wWeight> 53</Weight> 


Figure 4 


Results of using a column set with INSERT to update a record 
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DogID Name BithDate ‘Weight Leader Characteristics 


Raja NULL 53 1 <Weighb 53</Weighb <Leadem 1</Leaden 


Figure 5 


Results of using a column set with an UPDATE statement to update a record 


INSERT INTO DogCS ([Name], [LastRabies], [SledDog], [Rescue], [SpayNeuter], 
Characteristics) 

VALUES ('Star', "11/5/2088", 9, 9, 1, 

'<Handedness/><BirthDate/><Bi rthYear/><DeathDate/><Weight/><Leader/><OnGoingDrugs/>'); 


Filtered Indexes 

Filtered indexes are a new feature in SQL Server 2008 

that you can use with sparse columns—although you 

might find them helpful even if a table doesn’t have 
sparse columns. A filtered index is an index with 

a WHERE clause that lets you index a subset of 

the rows in a table, which is very useful with sparse 

columns because most of the data in the column is 
null. You can create a filtered index on the sparse 
column that includes only the non-null data. Doing 
so optimizes the use of tables with sparse columns. 
As with other indexes, the query optimizer uses a 
filtered index only if the execution plan the optimizer 
develops for a query will be more efficient with the 
index than without. The optimizer is more likely to 
use the index if the query’s WHERE clause matches 
that used to define the filtered index, but there are no 
guarantees. If it can use the filtered index, the query 
is likely to execute far more efficiently because SQL 

Server has to process less data to find the desired 

rows. There are three primary advantages to using 

filtered indexes: 

° They improve query performance, in part by 
enhancing the quality of the execution plan. 

° The maintenance costs for a filtered index are 
smaller because the index needs updating only 
when the data it covers changes, not for changes 
to every row in the table. 

° The index requires less storage space because it 
covers only a small set of rows. In fact, replacing 
a full-table, nonclustered index with multiple 
filtered indexes on different subsets of the data 
probably won’t increase the required storage 
space. 


In general, you should consider using a filtered 
index when the number of rows covered by the index 


LISTING 2: Inserting Data with Empty Tags and Selecting the Data 


SELECT [DogID], [Name] , [LastRabies] , [Sleddog] , [Handedness], [BirthDate] , [BirthYear], 
[DeathDate] , [Weight] , [Leader] , [Rescue] , [OnGoingDrugs] , [SpayNeuter] 
FROM DogCS WHERE Name = 'Star' 
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The following code creates a filtered index on the Useful Tools 
CurrencyRatelID field in the table, including all rows Sparse columns, column sets, and filtered indexes 


with a non-null value in the field: are useful tools that can make more efficient use 
of storage space on the server and improve the 

CREATE NONCLUSTERED INDEX performance of your databases and applications. 
fiSOHCurrencyNotNull ON But you'll need to understand how they work and 
Sales.SalesOrderHeader (CurrencyRateID) test carefully to make sure they work for your data 
WHERE CurrencyRateID IS NOT NULL; and with how your applications make use of the data. 
SOL} 

Next, we run the following SELECT statement: InstantDoc ID 104643 


SELECT SalesOrderID, CurrencyRateID, 
Comment FROM Sales.SalesOrderHeader 
WHERE CurrencyRateID = 6266; 


Figure 8 shows the estimated execution plan. You 
can see that the query is able to benefit from the ra n S O r mMm 
filtered index on the data. 


You're not limited to using filtered indexes 
with just the non-null data in a field. You can also o 
create an index on ranges of values. For example, l n t O a 
the following statement creates an index on just 
Data Quality Toolkit 
CREATE NONCLUSTERED INDEX 


the rows in the table where the SubTotal field is 
greater than $20,000: 
f1SOHSubTotal0ver20000 Plug these data quality transforms 
ON Sales.SalesOrderHeader (SubTotal) into your SSIS pipeline 
WHERE SubTotal > 20000; 


Here are a couple of queries that filter the data Profile your data 
based on the SubTotal field: j 


SELECT * FROM Sales.SalesOrderHeader Cleanse your data 
WHERE SubTotal > 30900; 


Parse & standardize your 


SELECT * FROM Sales.SalesOrderHeader contact data 


WHERE SubTotal > 21009 
AND SubTotal < 23000; 


Match/dedupe records 
across data sets 


Figure 9 shows the estimated execution plans 
for these two SELECT statements. Notice that 
the first statement doesn’t use the index, but Enrich your data 
the second one does. The first query probably 
returns too many rows (1,364) to make the use 
of the index worthwhile. But the second query Monitor your data ir 
returns a much smaller set of rows (44), so the 
query is apparently more efficient when using the 
index. 

You probably won’t want to use filtered Ask about our Guaranteed ROI program. 
indexes all the time, but if you have one of the 
scenarios that fit, they can make a big difference 
in performance. As with all indexes, you'll need 
to take a careful look at the nature of your data View a DEMO at 


and the patterns of usage in your applications. MelissaData.com/transform MELISSA DATA 
Yet again, SQL Server mirrors life: it’s full of or call 1-800-MELISSA Your Partner in Data Quality 
tradeoffs. 
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Useful Tools 
Sparse columns, column sets, and filtered indexes 


The following code creates a filtered index on the 
CurrencyRateID field in the table, including all rows 


with a non-null value in the field: 


CREATE NONCLUSTERED INDEX 
fiSOHCurrencyNotNull ON 
Sales.SalesOrderHeader (CurrencyRateID) 
WHERE CurrencyRateID IS NOT NULL; 


Next, we run the following SELECT statement: 


SELECT SalesOrderID, CurrencyRateID, 
Comment FROM Sales.SalesOrderHeader 
WHERE CurrencyRateID = 6266; 


Figure 8 shows the estimated execution plan. You 
can see that the query is able to benefit from the 
filtered index on the data. 

You're not limited to using filtered indexes 
with just the non-null data in a field. You can also 
create an index on ranges of values. For example, 
the following statement creates an index on just 
the rows in the table where the SubTotal field is 
greater than $20,000: 


CREATE NONCLUSTERED INDEX 
fi SOHSubTotal0ver29000 
ON Sales.SalesOrderHeader (SubTotal) 
WHERE SubTotal > 20000; 


Here are a couple of queries that filter the data 
based on the SubTotal field: 


SELECT * FROM Sales.SalesOrderHeader 
WHERE SubTotal > 30000; 


SELECT * FROM Sales.SalesOrderHeader 
WHERE SubTotal > 21000 
AND SubTotal < 23000; 


Figure 9 shows the estimated execution plans 
for these two SELECT statements. Notice that 
the first statement doesn’t use the index, but 
the second one does. The first query probably 
returns too many rows (1,364) to make the use 
of the index worthwhile. But the second query 
returns a much smaller set of rows (44), so the 
query is apparently more efficient when using the 
index. 

You probably won’t want to use filtered 
indexes all the time, but if you have one of the 
scenarios that fit, they can make a big difference 
in performance. As with all indexes, you'll need 
to take a careful look at the nature of your data 
and the patterns of usage in your applications. 
Yet again, SQL Server mirrors life: it’s full of 
tradeoffs. 
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are useful tools that can make more efficient use 
of storage space on the server and improve the 
performance of your databases and applications. 
But you'll need to understand how they work and 
test carefully to make sure they work for your data 
and with how your applications make use of the data. 
SOU 

InstantDoc ID 104643 


Transform 


SSIS into a 
Data Quality Toolkit 


Plug these data quality transforms 
into your SSIS pipeline 


Profile your data 


Cleanse your data 


Parse & standardize your 
contact data 


Match/dedupe records 
across data sets 


Enrich your data 


Monitor your data | 
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Z Results | Là Messages! 


Sleddog Handedness  BirthDate BirthYear 
2008-11-05 0 1900-01-01 0 


DogIlD Name LastRabies 


: Star 


DeathDate 
1900-01-01 0 0 0 1 


Weight Leader Rescue OnGoingDrugs - SpayNeuter 


Figure 6 


Results of inserting data with empty tags in a column set 


a Results | Gà Messages | 3 Execution plan | 
Query 1: 


100% 


5 


Clustered Index Scan (Clustered) 
[DogCS] . [PR_DogCS__46F46EC907020F2.. 
Cost: 100 + 


Query cost (relative to the batch): 


SELECT * FROM [DogCS] WHERE [Weight]>@1 


E 


A 


- Compute Scalar j 
Cost: 0 $ 


Figure 7 


Estimated execution plan for a query on a small table with sparse columns and a 
filtered index 


[ji 5 
Z Results | L} Messages) à` Execution plan | 
Query 1: Query cost (relative to the batch): 100% 


SELECT [SalesOrderID], [CurrencyRateID] , [Comment] FROM [Sales]. [Salesory 


él az 


Nested Loops ^ 7 Index Seek (NonClustered) 
(Imer Join) [SalesOrderHeader]. [ fiSOHCurrencyNo... 
Cost: 0% Cost: 33 % 


= Key Lookup (Clustered) 
[SalesOrderHeader]. [PK_SalesOrderHe... 
Cost: 67 % 


Figure 8 


Execution plan for query that uses a filtered index 


where the table includes a small number of non- 
null values in the field 

° for columns that represent categories of data, 
such as a part category key 

° for columns with distinct ranges of values, such as 
dates broken down into months, quarters, or years 


The WHERE clause in a filtered index definition 
can use only simple comparisons. This means that 
it can’t reference computed, user data type, spatial, 
or hierarchyid fields. You also can’t use LIKE for 
pattern matches, although you can use IN, IS, and IS 
NOT. So there are some limitations on the WHERE 
clause. 


Working with Filtered Indexes 
Let’s look at an example of using a filtered index. 
The following statement creates a filtered index on 
the DogCS table: 


CREATE INDEX fiDogCSWeight 
ON DogCS([Weight]) 
WHERE [Weight] > 40; 


This filtered index indexes only the 


E Results | [a Messages)” Execution pln | 


rows where the dog’s weight is greater 


Query 1: Query cost (relative to the batch): 80% 
SELECT * FROM [Sales]. [SalesOrderHeader] WHERE [SubTotal]>@1 


than 40 pounds, which automatically 


š — — Clustered Index Scan (Clustered) 
SELECT Filter Compute Scalar Compute Scalar 
Cost: 0 $ Cost: 3% Cost: 14 Cost: 1-4 [Sales0rderHeader]. [PK_SalesOrderHe.. 


Cost: 96 $ 


excludes all the rows with a null value 
in that field. After you create the index, 
you can execute the following state- 
ment to attempt to use the index: 


Query 2: Query cost (relative to the batch): 20% 
SELECT * FROM [Sales] . [SalesOrderHeader] WHERE [SubTotal]>@1 AND [SubTotal] <@2 


E E i g 


SELECT pati ai — Compute Scalar 
Cost: 0 $ pais git Cost: 0% 


Cost: 0 ¢ 
E] tal 


Compute Scalar 
Cost: 0% 


Compute Scalar 
Cost: 0 4 


Cost: 98 $ 


Key Lookup (Clustered) 
[SalesOrderHeader]. [PK_SalesOrderHe.. 


SELECT * FROM DogCS WHERE 
[Weight] > 40 


If you examine the estimated or 
actual execution plan for this SELECT 
statement—which Figure 7 shows— 


Figure 9 


Estimated execution plans for two queries on a field with a filtered index 


is small in relation to the number of rows in the 
table. As that ratio grows, it starts to become better 
to use a standard, nonfiltered index. Use 50 percent 
as an initial rule of thumb, but test the performance 
benefits with your data. Use filtered indexes in these 
kinds of scenarios: 
° for columns where most of the data in a field is 
null, such as with sparse columns; in other words, 
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you'll find that SQL Server doesn’t use 

the index. Instead, it uses a clustered 

index scan on the primary key to visit 

each row in the table. There are too few 
rows in the DogCS table to make the use of the index 
beneficial to the query. 

To explore the benefits of using filtered indexes, 
let’s look at an example using the Adventure- 
Works2008 Sales.SalesOrderHeader table. It has a 
CurrencyRateID field that has very few values. This 
field isn’t defined as a sparse column, but it might be 
a good candidate for making sparse. 
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Useful Tools 
Sparse columns, column sets, and filtered indexes 


The following code creates a filtered index on the 
CurrencyRateID field in the table, including all rows 


with a non-null value in the field: 


CREATE NONCLUSTERED INDEX 
fiSOHCurrencyNotNull ON 
Sales.SalesOrderHeader (CurrencyRateID) 
WHERE CurrencyRateID IS NOT NULL; 


Next, we run the following SELECT statement: 


SELECT SalesOrderID, CurrencyRateID, 
Comment FROM Sales.SalesOrderHeader 
WHERE CurrencyRateID = 6266; 


Figure 8 shows the estimated execution plan. You 
can see that the query is able to benefit from the 
filtered index on the data. 

You're not limited to using filtered indexes 
with just the non-null data in a field. You can also 
create an index on ranges of values. For example, 
the following statement creates an index on just 
the rows in the table where the SubTotal field is 
greater than $20,000: 


CREATE NONCLUSTERED INDEX 
fi SOHSubTotal0ver29000 
ON Sales.SalesOrderHeader (SubTotal) 
WHERE SubTotal > 20000; 


Here are a couple of queries that filter the data 
based on the SubTotal field: 


SELECT * FROM Sales.SalesOrderHeader 
WHERE SubTotal > 30000; 


SELECT * FROM Sales.SalesOrderHeader 
WHERE SubTotal > 21009 
AND SubTotal < 23000; 


Figure 9 shows the estimated execution plans 
for these two SELECT statements. Notice that 
the first statement doesn’t use the index, but 
the second one does. The first query probably 
returns too many rows (1,364) to make the use 
of the index worthwhile. But the second query 
returns a much smaller set of rows (44), so the 
query is apparently more efficient when using the 
index. 

You probably won’t want to use filtered 
indexes all the time, but if you have one of the 
scenarios that fit, they can make a big difference 
in performance. As with all indexes, you'll need 
to take a careful look at the nature of your data 
and the patterns of usage in your applications. 
Yet again, SQL Server mirrors life: it’s full of 
tradeoffs. 
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are useful tools that can make more efficient use 
of storage space on the server and improve the 
performance of your databases and applications. 
But you'll need to understand how they work and 
test carefully to make sure they work for your data 
and with how your applications make use of the data. 
SQL 
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Ils Your Physical Disk I/O 


Affectin 


Performance? 


Here are three ways to find out 


s a performance consultant, I see SQL 

Server installations that range in size from 

the tiny to the largest in the world. But 
one of the most elusive aspects of today’s SQL Server 
installations is—without a doubt—a properly sized 
or configured storage subsystem. There are many 
reasons why this is the case, as you'll see. 

My goal is to help you gather and analyze the 
necessary data to determine whether your storage 
is keeping up with the demand of your workloads. 
That goal is easier stated than accomplished, because 
every system has unique objectives and thresholds of 
what’s acceptable. My plan is not to teach you how 
to configure your storage; that’s beyond the scope of 
this article. Rather, I want to show you how you can 
use a fairly common set of industry guidelines to get 
an initial idea about acceptable storage performance 
related to SQL Server. 


Three Reasons 

The most common reason for an improperly sized 
or configured storage subsystem is simply that 
insufficient money is budgeted for the storage from 
the start. Performance comes from having lots of 
disks, which tends to give an excess of free disk space. 
Suffice it to say, we worry too much about the presence 
of wasted free disk space and not enough about the 
performance aspect. Unfortunately, once a system is 
in production, changing the storage configuration is 
difficult. Many installations fall into this trap. 

The second most common reason is that many 
SQL Server instances use storage on a SAN that’s 
improperly configured for their workload and that 
shares resources with other I/O-intensive applica- 
tions—particularly true when the shared resources 
include the physical disks themselves. Such a scenario 
not only affects performance but also can lead to 
unpredictable results as the other applications’ work- 
load changes. 

The third most common reason is that many 
DBAs simply aren’t as familiar with hardware-related 
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concerns as they are with SQL Server. This imbal- 
ance can lead to less-than-desirable decisions when it 
comes to properly sizing the storage. 


Three Sources of Help 

Before I dive into acceptable limits, I should point 
out one important aspect of evaluating I/O response 
times: Just because you deem I/O response times 
unacceptable doesn’t automatically mean that you 
need to adjust your I/O capacity. Often, poor response 
times mean that you’re simply doing too much phys- 
ical I/O or that you're doing it inefficiently. If you're 
scanning tables or indexes when you should be doing 
seeks instead, you'll ultimately induce more physical 
T/O than you might otherwise incur. When it comes 
to efficient I/O processing, never underestimate the 
importance of a well tuned and optimized database 
and associated queries. A little tuning effort can usu- 
ally avoid a costly and often difficult upgrade of the 
storage subsystem. 

When it comes to defining I/O response times 
related to SQL Server, there are three primary 
sources of data that you should watch regularly. 
You obtain two of these—the virtual file and wait 
stats—directly from SQL Server, and you get the third 
via the Windows Performance Monitor counters. 
Individually and especially together, these perfor- 
mance metrics can shed a lot of light on how well 
your disk subsystem is handling your requests for 
physical I/O. You just need to know what to look for 
and where to find it. 

Physical I/O response time is the amount of time 
(in seconds) that an average read or write request to a 
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physical disk takes to complete. You can get this data 
easily by using Performance Monitor’s LogicalDisk 
or Physical Disk counters for Avg. Disk sec/Read 
and Avg. Disk sec/Write. Most experts would agree 
that with today’s technology, the response times when 
using 15,000rpm disks should be in the range that 
Figure | shows. 

Keep in mind that these numbers are what most 
systems can expect to achieve with a properly sized 
and configured storage subsystem. If you’re averaging 
slower response times, you probably have room for 
improvement. But remember that what’s deemed 
unacceptable for one system might be perfectly fine 
for another system, due to the individual requirements. 
You'll need to make that determination on a case-by- 
case basis, but you can use these numbers as guidelines 
so that you have something to shoot for. 

Another approach is to keep an eye on the PAGE- 
IOLATCH_XX and WRITELOG waits. These stats 
will also tell you a lot about how well you're keeping 
up with the data and log I/O requests. High average 
waits on either of these stats should spark interest 
in terms of physical I/O response times. To get up 
to speed on exactly what wait stats are and how to 


Audit Database changes, 


Read the SQL Transaction Log 


collect the information, refer to my articles “Getting 
to Know Wait Stats” (InstantDoc ID 96746) and 
“Dissecting SQL Server’s Top Waits” (InstantDoc 
ID 98112). 

You can also obtain average physical I/O read and 
write response times by using the virtual file statistics 
as outlined in “Getting to Know Virtual File Stats” 
(InstantDoc ID 96513). The file stats DMV also tells 
you the amount of data read or written—which can 
be extremely helpful in determining whether you're 
accessing too much data in the first place, as previously 
mentioned. 


It’s a Start 
Utilize these provided response times as an initial 
guideline. If you can meet these specs, you shouldn't 
have to worry about physical I/O as a bottleneck. Again, 
not every situation calls for such good response times 
and you need to adapt accordingly for each system. 
However, I’ve found that many people don’t know what 
to shoot for in the first place! Hopefully, this article 
provides a good start so that you can concentrate on a 
firm objective in terms of I/O response time. SQL 
InstantDoc ID 103703 
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and server components that enable managed 

self-service business intelligence (BI) for the 
next-generation Microsoft data platform. After I intro- 
duce you to PowerPivot, PI walk you through how 
to use it from the client (i.e., information producer) 
perspective. 


Ss: Server PowerPivot is a collection of client 


PowerPivot 

PowerPivot is a managed self-service BI product. In 
other words, it’s a software tool that empowers infor- 
mation producers (think business analysts) so that they 
can create their own BI applications without having 
to rely on traditional IT resources. If you're thinking 
“but IT needs control of our data,” don’t worry. IT 
will have control of the data PowerPivot consumes. 
Once the data is imported into PowerPivot, it becomes 
read only. IT will also have control of user access and 
data refreshing. Many of PowerPivot’s most compel- 
ling features require the use of a centralized server for 
distribution, which pushes the product’s usage toward 
its intended direction of complementing rather than 
competing against traditional BI solutions. This man- 
aged self-service BI tool effectively balances informa- 
tion producers’ need for ad-hoc analytical information 
with IT’s need for control and administration. 

Key features such as the PowerPivot Manage- 
ment Dashboard and data refresh process are why 
PowerPivot is categorized as a managed self-service BI 
product. Other key features include: 

e The use of Microsoft Excel on the client side. Most 
business users are already familiar with how to 
use Excel. By incorporating this new self-service 
BI tool into Excel, Microsoft is making further 
headway in its goal of bringing BI to the masses. 

è Excel’s PivotTable capabilities have improved 
over the years. For example, Excel 2010 includes 
such enhancements as slicers, which let you inter- 
actively filter data. PowerPivot further enhances 
Excel 2010’s native PivotTable capabilities. The 
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enhancements include Data Analysis Expressions 
(DAX) support, relationships between tables, and 
the ability to use a slicer on multiple PivotTables. 

e A wide range of supported data sources. PowerPivot 
supports mainstream Microsoft data repositories, 
such as SQL Server, SQL Server Analysis Services 
(SSAS), and Access. It also supports nontraditional 
data sources, such as SQL Server Reporting Services 
(SSRS) 2008 R2 reports and other Atom 1.0 data 
feeds. In addition, you can connect to any ODBC or 
OLE DB compliant data source. 

è Excellent data-processing performance. PowerPivot 
can process, sort, filter, and pivot on massive data 
volumes. The product makes extremely good use 
of x64 technology, multicore processors, and giga- 
bytes of memory on the desktop. 


PowerPivot consists of four key components. 
The client-side components are Excel 2010 and the 
PowerPivot for Excel add-on. Information 
producers who will be developing PowerPivot 
applications will need Excel 2010. Users of 
those applications can be running earlier 
versions of Excel. Under the hood, the 
add-on is the SSAS 2008 R2 engine running 
as an in-memory DLL, which is referred to as the 
VertiPaq mode. When information producers create 
data sources, define relationships, and so on in the 
PowerPivot window inside of Excel 2010, they’re 
implicitly building SSAS databases. 

The server-side components are SharePoint 2010 and 
SSAS 2008 R2 running in VertiPaq mode. SharePoint 
provides the accessibility and managed aspects of Power- 
Pivot. SSAS 2008 R2 is used for hosting, querying, and 
processing the applications built with PowerPivot after 
the user has deployed the workbooks to SharePoint. 


What You Will Need 
For this demonstration, you don’t need access to Share- 
Point 2010, as PII focus on the client-side experience. 
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and solutions. He’s a SQL Server MVP and holds 
several Microsoft certifications. 


Download the code at 
InstantDoc ID 104646. 
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What you will need is a machine or virtual machine 

(VM) with the following installed: 

© Excel 2010. The beta version is available at www 
-microsoft.com/office/2010/en/default.aspx. 

è PowerPivot for Excel add-on, which you can 
obtain at www.powerpivot.com. The add-on’s 
CPU architecture must match Excel 2010’s CPU 
architecture. 

° The relational database engine and SSRS in SQL 
Server 2008 R2. You can download the SQL 
Server 2008 R2 November Community Tech- 
nology Preview (CTP) at www.microsoft.com/ 
sqlserver/2008/en/us/R2Downloads.aspx. Note 
that if you’re deploying SSRS on an all-in-one 
machine running Windows Server 2008 or Win- 
dows Vista, you need to follow the guidelines 
in the Microsoft article at msdn.microsoft.com/ 
en-us/library/bb630430(SQL.105).aspx. 

© SQL Server 2008 R2 AdventureWorks 
sample databases, which you can get at 
msftdbprodsamples.codeplex.com. 


The PowerPivot for Excel add-on and Adventure- 
Works downloads are free. As of this writing, the 
Excel 2010 beta and SQL Server 2008 R2 November 
CTP are also free because they have yet to be com- 
mercially released. 

Finally, you need to download several T-SQL 
scripts and a sample Excel workbook from the 
SQL Server Magazine website. Go to www.sqlmag 
.com, enter 104646 in the InstantDoc ID box, click Go, 
then click Download the Code Here. Save the 104646 
.zip file on your machine. 


The Scenario 

The exercise I’m about to walk you through is 
based on the following fictitious business scenario: 
AdventureWorks has become a dominant force in 
its respective industries and wants to ensure that 
it retains its most valuable resource: its employees. 
As a result, AdventureWorks has conducted its first 
internal employee morale survey. In addition, the 
company has started tracking employee promo- 
tions. Management wants to analyze the results of 
the employee survey by coupling those results with 
the employee promotion data and core employee 
information (e.g., employees’ hire dates, their depart- 
ments). In the future, management would also like 
to incorporate outside data (e.g., HR industry data) 
into the analysis. 

The AdventureWorks BI team is too overwhelmed 
with resolving ongoing data quality issues and model 
modifications to accommodate such an ad-hoc 
analysis request. As a result, management has asked 
you to develop the AdventureWorks Employee 
Morale PowerPivot Application. To construct this 
application, you need the core employee information, 


the employee promotions data, and the employee 

morale survey results. Although all the data resides 

behind the corporate firewall, it’s scattered in different 
repositories: 

e The core employee information is in the 
DimEmployee table in the AdventureWorks DW 
(short for data warehouse) for SQL Server 2008 
R2 (AdventureWorksDW2008R2). 

è The list of employee promotions is in an SSRS 
report. 

è The results of the employee morale survey are in 
an Excel 2010 worksheet named AW_ 
EmployeeMoraleSurveyResults.xlsx. 


To centralize this data and create the application, you 
need to follow these steps: 

1. Create and populate the FactEmployeePromo- 
tions table. 

2. Use SSRS 2008 R2 to create and deploy the 
employee promotions report. 

3. Import the results of the employee morale 
survey into PowerPivot. 

4. Import the core employee data into 
PowerPivot. 

5. Import the employee promotions report into 
PowerPivot. 

6. Establish relationships between the three sets of 
imported data. 

7. Use DAX to add a calculated column. 

8. Hide unneeded columns. 

9. Create the PivotCharts. 


Step | 

In AdventureWorks DW, you need to create a new fact 
table that contains the employee promotions data. To 
do so, start a new instance of SQL Server Management 
Studio (SSMS) by clicking Start, All Programs, Micro- 
soft SQL Server 2008 R2 November CTP, SQL Server 
Management Studio. In the Connection dialog box, 
connect to a SQL Server 2008 R2 relational database 
engine that has the AdventureWorksDW2008R2 data- 
base pre-installed. Click the New Query button to open 
a new relational query window. Copy the contents of 
Create_FactEmployeePromotions.sql (which is in the 
104646.zip file) into the query window. Execute the 
script by pressing the F5 key. 

Once the FactEmployeePromotions table has been 
created, you need to populate it. Assuming you still have 
SSMS open, press Ctrl+Alt+Del to delete the existing 
script, then copy thecontents of Populate_FactEmployee 
Promotions.sql (which is in 104646.zip) into the query 
window. Execute the script by pressing the F5 key. 


Step 2 

With the FactEmployeePromotions table created 
and populated, the next step is to create and deploy 
the employee promotions report in SSRS 2008 R2. 
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First, open up a new instance of the Business Intel- 
ligence Development Studio (BIDS) by clicking 
Start, All Programs, Microsoft SQL Server 2008 R2 
November CTP, SQL Server Business Intelligence 
Development Studio. 

In the BIDS Start page, select File, New, Project 
to bring up the New Project dialog box. Choose 
the Report Server Project Wizard option, type ssrs_ 
AWDW for the project’s name, and click OK. In the 
Report Wizard’s Welcome page, click Next. 

In the Select the Data Source page, you need to 
create a new relational database connection to the same 
AdventureWorks DW database used in step 1. After 
you defined the new data source, click Next. 

In the Design the Query page, copy the contents of 
Employee_Promotions_Report_Source.sql (which is in 
104646.zip) into the Query string text box. Click Next 
to bring up the Select the Report Type page. Choose 
the Tabular option, and click Next. 

In the Design the Table page, select all three avail- 
able fields and click the Details button to add the 
fields to the details section of the report. Click Next to 
advance to the Choose the Table Style page, where you 
should leave the default settings. Click Next. 

At this point, you'll see the Choose the Deployment 
Location page shown in Figure 1. In this page, which 
is new to SQL Server 2008 R2, you need to make sure 
the correct report server URL is specified in the Report 
server text box. Leave the default values in the Deploy- 
ment folder text box and the Report server version drop- 
down list. Click Next. 

In the Completing the Wizard page, type 
EmployeePromotions for the report’s name. Click 
Finish to complete the report wizard process. 

Next, you need to deploy the finished report to 
the target SSRS report server. Assuming you specified 
the correct URL for the report server in the Choose 
the Deployment Location page, you simply need 
to deploy the report. If you need to alter the target 
report server settings, you can do so by right-clicking 
the project in Solution Explorer and selecting the 
Properties option. Update your target report server 
setting and click OK to close the SSRS project’s 
Properties dialog box. 

To deploy the report, select Build, then Deploy 
ssrs_AWDW from the main Visual Studio 2008 menu. 
To confirm that your report deployed correctly, review 
the text that’s displayed in the Output window. 


Step 3 

With the employee promotions report deployed, 
you can now turn your attention to constructing the 
PowerPivot application, which involves importing all 
three data sets into PowerPivot. Because the employee 
morale survey results are contained within an existing 
Excel 2010 worksheet, it makes sense to first import 
those results into PowerPivot. 
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Open the AW_EmployeeMoraleSurveyResults.xlsx 
file, which you can find in 104646.zip, and go to the 
Survey Results tab. As Figure 2 shows, the survey results 
are in a basic Excel table with three detailed values 
(Position, Compensation, and Management) and one 
overall self-ranking value (Overall). The scale of the 
rankings is 1 to 10, with 10 being the best value. To tie 
the survey results back to a particular employee, the 
worksheet includes the employee’s internal key as well. 

With the survey results open in Excel 2010, you 
need to import the table’s data into PowerPivot. 
One way to import data is to use the Linked Tables 
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Specifying where you want to deploy the EmployeePromotions report 
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Figure 3 
The PowerPivot ribbon in Excel 2010 
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feature, which links data from a parent worksheet. 
You access this feature from the PowerPivot ribbon, 
which Figure 3 shows. Before I explain how to 
use the Linked Tables feature, however, I want to 
point out two other items in the PowerPivot ribbon 
because they’re crucial for anyone using PowerPivot. 
On the left end of the ribbon, notice the Power Pivot 
window button. The PowerPivot window is the main 
UI you use to construct a PowerPivot model. Once 
the model has been constructed, you can then pivot 
on it, using a variety of supported PivotTables and 
PivotCharts in the Excel 2010 worksheet. Collec- 
tively, these components form what is known as a 
PowerPivot application. To launch the PowerPivot 
window, you can either click the Power Pivot window 
button or invoke some other action that automati- 
cally launches the PowerPivot window with imported 
data. The Linked Tables feature performs the latter 
behavior. 

The other item you need to be familiar with in 
the PowerPivot ribbon is the Options & Diagnos- 
tics button. By clicking it, you gain access to key 
settings, including options for tracing PowerPivot 
activity, recording client environment specifics with 
snapshots, and participating in the Microsoft Cus- 
tomer Experience Improvement Program. 

Now, let’s get back to creating the Adventure- 
Works Employee Morale PowerPivot Application. 
First, click the PowerPivot tab or press Alt+G to 
bring up the PowerPivot ribbon. In the ribbon, click 
the Create Linked Table button. The PowerPivot 
window shown in Figure 4 should appear. Notice 
that the survey results have already been imported 
into it. At the bottom of the window, right-click the 
tab labeled Table 1, select the Rename option, type 
Survey Results, and press Enter. 


Step 4 

The next step in constructing the PowerPivot applica- 
tion is to import the core employee data. To do so, 
click the From Database button in the PowerPivot 
window, then select the From SQL Server option. 
You should now see the Table Import Wizard’s 
Connect to a Microsoft SQL Server Database page, 
which Figure 5 shows. In this page, replace Sq/Server 
with Core Employee Data in the Friendly connec- 
tion name text box. In the Server name drop-down 
list, select the SQL Server instance that contains 
the AdventureWorks DW database. Next, select the 
database labeled AdventureWorksDW2008r2 in the 
Database name drop-down list, then click the Test 
Connection button. If the database connection is 
successful, click Next. 

In the page that appears, leave the default option 
selected and click Next. You should now see a list of 
the tables in AdventureWorks DW. Select the check 
box next to the table labeled DimEmployee and 
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click the Preview & Filter button in the lower right 
corner. 

In the Preview Selected Table page, drag the hori- 
zontal scroll bar as far as possible to the right, then 
click the down arrow to the right of the “Status” 
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From From From Data Refresh Existing z PivotTable Switch to 
Database” Files” Feeds ~ Connections Workbook 
Get External Data Connections Paste from Clipboard Calculation View 
[EmployeeKey] ~ ji 
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Compensation [J/Overall B| Management BEUC T 
z 2 2 


2 
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3 5 3 4 5 
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Figure 4 


The PowerPivot window 


=e a 
Table Import Wizard 


Connect toa Microsoft SQL Server Database 
Enter the information required to connect to the SQL Server database. 


Friendly connection name: SqlServer 
Server name: + 
Log on to the server 


© Use Windows Authentication 


© Use SQL Server Authentication 
User name: 
Password: 
Database name: v 
| Advanced | | Test Connection | 
Figure 5 


Connecting to AdventureWorks DW to import the core employee data 
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Figure 6 
Previewing the filtered core employee data 
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| Table Import Wizard -ee 
Connectto a Data Feed 
Enter the information required to connect to a data feed. 
Friendly connection 
renner DataFeed 
Report Path: Browse. 
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| Advanced | l Test Connection 
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Figure 7 


Connecting to the SSRS report to import the list of employee promotions 
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Hiding unneeded columns 
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column header. In the Text Filters box, clear the 
(Select All) check box and select the Current check 
box. This action will restrict the employee records to 
only those that are active. Figure 6 shows a preview 
of the filtered data set. 

With the employee data filtered, go ahead and 
click OK. Click Finish to begin importing the 
employee records. In the final Table Import Wizard 
page, which shows the status of the data being 
imported, click Close to complete the database import 
process and return to the PowerPivot window. 


Step 5 
In Step 2, you built an SSRS 2008 R2 report that 
contains employee promotion data. You now need 
to import this report’s result set into the PowerPivot 
application. To do so, click From Data Feeds on the 
Home tab, then select the From Reporting Services 
option. You should see the Table Import Wizard’s 
Connect to a Data Feed page, which Figure 7 shows. 
In the Friendly connection name text box, replace 
DataFeed with Employee Promotions and click the 
Browse button. In the Browse dialog box, navigate to 
your SSRS report server and select the recently built 
EmployeePromotions.rdl report. When you see a pre- 
view of the report, click the Test Connection button. 
If the data feed connection is successful, click Next. 
In the page that appears, you should see that an 
existing report region labeled table] is checked. To 
change that name, type EmployeePromotions in the 
Friendly Name text box. Click Finish to import the 
report’s result set. When a page showing the data 
import status appears, click the Close button to return 
to the PowerPivot window. 


Step 6 

Before you can leverage the three data sets you 

imported into PowerPivot, you need to establish 

relationships between them, which is also known 

as correlation. To correlate the three data sets, click 

the Table tab in the PowerPivot window, then select 

Create Relationship. In the Create Relationship dialog 

box, select the following: 

e SurveyResults in the Table drop-down list 

° EmployeeKey in the Column drop-down list 

e DimEmployee in the Related Lookup Table drop- 
down list 

° EmployeeKey in the Related Lookup Column 
drop-down list 


Click the Create button to establish the designated 
relationship. 

Perform this process again (starting with clicking 
the Table tab), except select the following in the Create 
Relationship dialog box: 

e SurveyResults in the Table drop-down list 
° EmployeeKey in the Column drop-down list 
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¢ EmployeePromotions in the Related Lookup 
Table drop-down list 

¢ EmployeeKey in the Related Lookup Column 
drop-down list 


To confirm the creation of both relationships, 
select Manage Relationships on the Table tab. You 
should see two relationships listed in the Manage 
Relationships dialog box, as Figure 8 shows. 


Step 7 

Because PowerPivot builds an internal SSAS database, 
a hurdle that the Microsoft PowerPivot product team 
had to overcome was exposing MDXs in an Excel-like 
method. Enter DAX. With DAX, you can harness 
the power of a large portion of the MDX language 
while using Excel-like formulas. You leverage DAX as 
calculated columns in PowerPivot’s imported data sets 
or as dynamic measures in the resulting PivotTables 
and PivotCharts. 

In this example, let’s add a calculated column that 
specifies the year in which each employee was hired. 
Select the Dim-Employee table in the PowerPivot 
window. In the Column tab, click Add Column. A 
new column will be appended to the table’s columns. 
Click the formula bar located directly above the new 
column, type = YEAR(/HireDate]), and press Enter. 
Each employee’s year of hire value now appears 
inside the calculated column. To rename the calcu- 
lated column, right-click the calculated column’s 
header, select Rename, type YearOfHire, and press 
Enter. 

Don’t let the simplicity of this DAX formula fool 
you. The language is extremely powerful and capable 
of supporting capabilities such as cross-table lookups 
and parallel time-period functions. 


Step 8 

Although the data tables’ current schema fulfills 
the scenario’s analytical requirements, you can go 
the extra mile by hiding unneeded columns. (In the 
real-world, a better approach would be to not import 
the unneeded columns for file size and performance 
reasons, but I want to show you how to hide columns.) 
Hiding columns is useful when you think a column 
might be of some value later or when calculated col- 
umns aren't currently being used. 

Select the SurveyResults table by clicking its 
corresponding tab at the bottom of the Power- 
Pivot window. Next, click the Hide and Unhide 
button on the Column ribbon menu. In the Hide 
and Unhide Columns dialog box, which Figure 9 
shows, clear the check boxes for EmployeeKey 
under the In Gemini and In Pivot Table columns. 
(Gemini was the code-name for PowerPivot. This 
label might change in future releases.) Click OK. 
Following the same procedure, hide the remaining 
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unneeded columns 
listed in Table 1. 


Step 9 
In the PowerPivot 
window, click the Pivot- 
Table button on the 
Home tab, then select 
Four Charts. In the Insert 
Pivot dialog box, leave the 
New Worksheet option 
selected and click OK. 
To build the first of 
the four pivot charts, 
Overall Morale by 
Department, select 
the upper left chart 
with your cursor. In 
the Gemini Task Pane, 
which Figure 10 shows, 
select the Overall check 
box under the Survey- 
Results table. (Note 
that this pane’s label 
might change in future 
releases.) In the Values 
section of the Gemini 
Task Pane, you should 
now see Overall. Click 


> 


TABLE |: Remaining Columns that 
Need to Be Hidden 


DimEmployee 
DimEmployee 
DimEmployee 
DimEmployee 
DimEmployee 
DimEmployee 
DimEmployee 
DimEmployee 
DimEmployee 
DimEmployee 
DimEmployee 
DimEmployee 
DimEmployee 
DimEmployee 
DimEmployee 
DimEmployee 
DimEmployee 
DimEmployee 
DimEmployee 
DimEmployee 
EmployeePromotions 


EmployeeKey 
ParentEmployeeKey 
EmployeeNationallDAlternateKey 


ParentEmployeeNationallDAlternateKey 
SalesTerritory 
SalesTerritoryKey 
FirstName 

LastName 

MiddleName 

NameStyle 

HireDate 

BirthDate 

LoginID 

EmailAddress 

Phone 

Marital Status 
EmergencyContactName 
EmergencyContactPhone 
StartDate 

EndDate 

EmployeeKey 


Gemini Task Pane 
Choosefields to add to report: 


Search 


E 
] Position 
| Compensation 
Management 
Overall 
El DimEmployee 
Title 
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YearOfHire 
E EmployeePromotions 
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j Slicers Horizontal 


j Legend Fields (Series) 
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Figure 10 


Using the Gemini Task Pane to create a PivotChart 
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PivotChart showing overall morale by department 
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promoted. Or you can refine your 

analysis by selecting a specific year 
of hire or by focusing on certain departments. 
Keeping this in mind, create the three remaining 
PivotCharts using any combination of attributes 
and morale measures. (For more information 
about how to create PowerPivot PivotCharts, see 
www.pivot-table.com.) 

After all four PivotCharts are created, you 
can clean the application up for presentation. 
Select the Page Layout Excel ribbon and clear 
the Gridlines View and Headings View check 
boxes in the Sheet Options group. Next, expand 
the Themes drop-down list in the Page Layout 
ribbon and select a theme of your personal 
liking. You can remove a PivotChart’s legend by 
right-clicking it and selecting Delete. To change 
the chart’s title, simply click inside it and type 
the desired title. 


Want to Learn More? 
After providing you with a quick overview of 
PowerPivot, I concentrated on the client experience 
by discussing how to create a PowerPivot 
application that leverages a variety of data 
sources. If you want to learn about the PowerPivot 
server-side experience or DAX, there are several 
helpful websites. Besides Microsoft’s PowerPivot 
site (powerpivot.com), you can check out 
PowerPivotPro (powerpivotpro.com), PowerPivot 
Twin (powerpivottwins.com), Great PowerPivot 
FAQ (powerpivotfaq.com/Lists/TGPPF/AllItems 
.aspx), or PowerPivot and DAX Information 
Hub (powerpivot-info.com). SQL 
InstantDoc ID 104646 
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Improving the Performance of 


Distributed 


Queries 


Pitfalls to avoid and tweaks to try 


istributed queries involve data that’s 
D stored on another server, be it on a 

server in the same room or on a machine 
a half a world away. Writing a distributed query 
is easy—so easy, in fact, that you might not even 
consider them any different from a run-of-the-mill 
local query—but there are many potential pitfalls 
waiting for the unsuspecting DBA to fall into. This 
isn’t surprising given that little information about 
distributed queries is available online. There’s even 
less information available about how to improve 
their performance. 

To help fill this gap, I'll be taking an in-depth 
look into how SQL Server 2008 treats distrib- 
uted queries, specifically focusing on how queries 
involving two or more instances of SQL Server are 
handled at runtime. Performance is key in this situ- 
ation, as much of the need for distributed queries 
revolves around OLTP—solutions for reporting 
and analysis are much better handled with data 
warehousing. 

The three distributed queries I’ll be discussing 
are designed to be run against a database named 
SCRATCH that resides on a server named 
REMOTE. If you'd like to try these queries, I’ve 
written a script, CreateSCRATCHDatabase.sql, 
that creates the SCRATCH database. You can 
download this script as well as the code in the list- 
ings by going to www.sqlmag.com, entering 103688 
in the InstantDoc ID text box, and clicking the 
Download the Code Here button. After Create- 
SCRATCHDatabase.sq] runs (it'll take a few min- 
utes to complete), place the SCRATCH database 
on a server named REMOTE. This can be easily 
mimicked by creating an alias named REMOTE on 
your SQL Server machine in SQL Server Configu- 
ration Manager and having that alias point back to 
your machine, as Figure 1 shows. Having configured 
an alias in this way, you can then proceed to add 
REMOTE as a linked SQL Server on your machine 
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(in this example, MYSERVER). While somewhat 
artificial, this setup is perfectly adequate for 
observing distributed query performance. In addi- 
tion, some tasks, such as profiling the server, are 
simplified because all queries are being performed 
against a single instance. 


DistributedQuery I 
As distributed queries go, the simplest form is one 
that queries a single remote server, with no reli- 
ance on any local data. For example, suppose you 
run the query in Listing 1 (DistributedQuery1) in 
SQL Server Management Studio (SSMS) with the 
Include Actual Execution Plan option enabled. 
As far as the local server (which PI call 
LOCAL) is concerned, the entire state- 
ment is offloaded to the remote server, 
more or less untouched, with LOCAL 


-— — 
REMOTE Properties 
=—_ 

Alias oo O E 

|E General 
Alias Name REMOTE 
Port No 
Protocol TCP/IP 
Server MYSERVER| 

| Server 

| Name of server to which the alias points 

[mnan] 


Figure | 
Creation of the REMOTE alias 
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patiently waiting for the results to stream back. As 
Figure 2 shows, this is evident in the tooltip for the 
Remote Query node in the execution plan. Although 
the query was rewritten before it was submitted to 
REMOTE, semantically the original query and the 
rewritten query are the same. Note that the Compute 


LISTING |: DistributedQuery! (Queries a 
Remote Server) 


USE SCRATCH 
GO 


SELECT TOP 1000 * 
FROM REMOTE.Scratch.dbo.Tablel T1 
INNER JOIN REMOTE.Scratch.dbo.Table2 T2 
ON T1.1D 
= 121) 


REMOTE 

Remote Query 

SELECT TOP (1000) “Tb!1001"."1D" 

"Col1011","Tbl1001"."GUID" 

"Col1012","Tbl1001"."GUIDSTR" "Col1013","Tb11003"."ID" 

"Col1007","Tbl1003"."GUID" 

|. "“Coli008","Tbl1003"."GUIDSTR" "Col1009",? "Expr1004" 

K FROM "Scratch"."dbo".”"Table2" 
"Tbli003","Scratch"."dbo"."Table1" "Tbl1001" WHERE 

"Tbl1003"."ID"="Tbl1001"."ID" 


Figure 2 


Excerpt from the execution plan for DistributedQuery| 


LISTING 2: DistributedQuery2 (Queries the Local Server 
and a Remote Server) 


USE SCRATCH 
GO 


SELECT = 


FROM dbo.Tablel T1 
INNER JOIN REMOTE.Scratch.dbo.Table2 T2 


ON T1.ID 
= T2.ID 


WHERE T1.GUIDSTR < 'ØØ1' 


Scalar node that appears between the Remote Query 
and SELECT in the query plan merely serves to 
rename the fields. (Their names were obfuscated by 
the remote query.) 

LOCAL doesn’t have much influence over how 
DistributedQuery1 is optimized by REMOTE—nor 
would you want it to. Consider the case where 
REMOTE is an Oracle server. SQL Server thrives on 
creating query plans to run against its own data engine, 
but doesn’t stand a chance when it comes to rewriting 
a query that performs well with an Oracle database 
or any other data source. This is a fairly naive view of 
what happens because almost all data source providers 
don’t understand T-SQL, so a degree of translation 
is required. In addition, the use of T-SQL functions 
might result in a different approach being taken when 
querying a non-SQL Server machine. (Because my 
main focus is on queries between SQL Server instances, 
I won't go into any further details here.) 

So, what about the performance of single-server 
distributed queries? A distributed query that touches 
just one remote server will generally perform well if 
that query’s performance is good when you execute 
it directly on that server. 


DistributedQuery2 
In the real world, it’s much more likely that a distributed 
query is going to be combining data from more than 
one SQL Server instance. SQL Server now has a much 
more interesting challenge ahead of it—figuring out 
what data is required from each instance. Determining 
which columns are required from a particular instance 
is relatively straightforward. Determining which rows 
are needed is another matter entirely. 

Consider DistributedQuery2, which Listing 2 
shows. In this query, Table! from LOCAL is being 


Query 1: Query cost (relative to the batch): 100% 
SELECT * FROM dbo.Tablel T1 INNER JOIN REMOTE.SCRATCH.dbo.Table2 T2 ON T1.ID = T2.. 
E| ie] ic] a 
SELECT Nested Loops ——__ ‘Nested Loops £ Index Seek (NonClustered) 
tia: S (Inner Join) (Inner Join} [Table1]. [idx_2] [T1] 
: iiini Cost: 0 4% Cost: 0% Cost: 3% 
= Key Lookup (Clustered) 
[Tablel] .[PK_Tablel_ 3214EC277F€0E.. 
Cost: 21 $% 
E = 
— ; = 
Compute Scalar Remote Query 
Cost: 0% Cost: 76 & 
Figure 3 


The execution plan for DistributedQuery2 
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Query i: Query cost (relative to the batch): 100% 
SELECT * FROM dbo.Tabiei T1 INNER JOIN REMOTE.SCRATCH.dbo.Tablie2 T2 ON T1.ID = T2.. 
X ti 
a = Merge Join = Clustered Index Scan (Clustered) 
Eor EA (Inner Join} [Table1] -[PK_Tablel __3214EC277F60E.. 
— Cost: 1% Cost: 2 $% 
4 z 
SSS — -> 
~ Compute Scalar ~ Remote Query 
Cost: 0 $% Cost: 97 & 
Figure 4 
The execution plan after '001' was changed to 'l' in DistributedQuery2’s WHERE clause 
joined to Table2 on REMOTE, with a restriction = 
being placed on Tablel’s GUIDSTR column by the 
WHERE clause. Figure 3 presents the execution plan 
for this query. The plan first shows that SQL Server 
has chosen to seek forward (i.e., read the entries SA 
between two points in an index in order) through the 
index on GUIDSTR for all entries less than '001'. 
For each entry, SQL Server performs a lookup in the 
clustered index to acquire the data. This approach Frenn 
isn’t a surprising choice because the number of entries ` 
is expected to be low. 2 
Next, for each row in the intermediate result set, z 
SQL Server executes a parameterized query against = 
REMOTE, with the Tablel ID value being used to =o a 
acquire the record that has a matching ID value in 
Table2. SQL Server then appends the resultant values 
to the row. 
If you’ve had any experience with execution plan p 
analysis, SQL Server’s approach for Distributed- 
Query2 will look familiar. In fact, if you remove 
REMOTE from the query and run it again, the 
resulting execution plan is almost identical to the one i 
o 5,000 10,000 15,000 20,000 25,000 30,000 
in Figure 3. The only difference is that the Remote 
Number of Rows from Table1 
Query node (and the associated Compute Scalar 
node) is replaced by a seek into the clustered index == Remote Query =E= Local Query =b=Remote with Join Hint 
of Table2. In regard to efficiency (putting aside the 
issues of network roundtrips and housekeeping), this Figure 5 
is the same plan. Graphically comparing the results from three queries 
Now, let’s go back to the original distributed 
query in Listing 2. If you change '001' in the especially because the scan will be predominately 
WHERE clause to 'l' and rerun Distributed- sequential. 
Query2, you get the execution plan in Figure 4. In The remote data acquisition changed in a similar 
terms of the local data acquisition, the plan has way. Rather than performing multiple queries against 
changed somewhat, with a clustered index scan Table2, a single query was issued that acquires all the 
now being the preferred means of acquiring the rows from the table. This change occurred for the 
necessary rows from Tablel. SQL Server correctly same reason that the local data acquisition changed: 
figures that it’s more efficient to read every row efficiency. 
and bring through only those rows that match the Next, the record sets from the local and remote 
predicate GUIDSTR < 'I' than to use the index data acquisitions were joined with a merge join, as 
on GUIDSTR. In terms of I/O, a clustered scan Figure 4 shows. Using a merge join is much more 
is cheap compared to a large number of lookups, efficient than using a nested-loop join (as was done 
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in the execution plan in Figure 3) because two large 
sets of data are being joined. 

SQL Server exercised some intelligence here. 
Based on a number of heuristics, it decided that the 
scan approach was more advantageous than the seek 
approach. (Essentially, both the local and remote 
data acquisitions in the previous query are scans.) 
By definition, a heuristic is a “general formulation”; 
as such, it can quite easily lead to a wrong conclu- 
sion. The conclusion might not be downright wrong 
(although this does happen) but rather just wrong for 
a particular situation or hardware configuration. For 
example, a plan that performs well for two servers that 
sit side by side might perform horrendously for two 
servers sitting on different continents. Similarly, a plan 
that works well when executed on a production server 
might not be optimal when executed on a laptop. 

When looking at the execution plans in Figure 3 
and Figure 4, you might be wondering at what point 
does SQL Server decide that one approach is better 
than another. Figure 5 not only shows this pivotal 
moment but also succinctly illustrates the impreciseness 
of heuristics. To create this graph, I tested three queries 
on my test system, which consists of Windows XP Pro 
and SQL Server 2008 running on a 2.1GHz Core 2 
Duo machine with 3GB RAM and about 50MB per 
second disk I/O potential on the database drive. 

In Figure 5, the red line shows the performance 
of the first query I tested: DistributedQuery2 with 
REMOTE removed so that it’s a purely local query. 
As Figure 5 shows, the number of rows selected from 
Tablel by the WHERE clause peaks early on but it’s 


LISTING 3: DistributedQuery3 (Queries the 
Local Server and a View from a Remote Server) 


USE SCRATCH 


SELECT = 
FROM dbo.Tablel T1 
INNER JOIN REMOTE.Scratch.dbo.SimpleView T2 
ON T1.GUID 
= 7T2.GUID 
WHERE T1.ID < 5 


not until there are around 1,500 rows that SQL Server 
chooses an alternative approach. This isn’t ideal; if 
performance is an issue and you often find yourself 
on the wrong side of the peak, you can remedy the 
situation by adding the index hint 


WITH (INDEX = PRIMARY_KEY_INDEX_NAME) 


where PRIMARY_KEY_INDEX_NAME is the name 
of your primary key index. Adding this hint will result 
in a more predictable runtime, albeit at the potential 
cost of bringing otherwise infrequently referenced 
data into the cache. Whether such a hint is a help 
or hindrance will depend on your requirements. As 
with any T-SQL change, you should consider a hint’s 
impact before implementing it in production. 

Let’s now look at the blue line in Figure 5, which 
shows the performance of DistributedQuery2. For 
this query, the runtime increases steadily to around 20 
seconds then drops sharply back down to 3 seconds 
when the number of rows selected from Table! reaches 
12,000. This result shows that SQL Server insists on 
using the parameterized query approach (i.e., per- 
forming a parameterized query on each row returned 
from T1) well past the point that switching to the 
alternative full-scan approach (i.e., performing a full 
scan of the remote table) would pay off. SQL Server 
isn’t being malicious and intentionally wasting time. 
It’s just that in this particular situation, the param- 
eterized query approach is inappropriate if runtime is 
important. 

If you determine that runtime is important, this 
problem is easily remedied by changing the inner join 
to Table2 to an inner merge join. This change forces 
SQL Server to always use the full-scan approach. The 
results of applying this change are indicated by the 
purple line in Figure 5. As you can see, the runtime 
is much more predictable. 


DistributedQuery3 
Listing 3 shows DistributedQuery3, which differs from 
DistributedQuery2 several ways: 


Query 1: Query cost (relative to the batch): 


100% 
SELECT * FROM dbo.Tablel Ti INNER JOIN REMOTE.Scratch.dbo.SimpleView T2 ON T1.GULI.. 


E fe] y! 


SELECT =— Nested Loops Clustered Index Seek (Clustered) 
2 aoe (Inner Join) [Table1].[PK_Tablel_ 3214EC277F€0E... 
> Cost: 1% Cost: 0 $% 


E Ed 


~ Compute Scalar Remote Query 
Cost: 0 $% Cost: 99 $% 


Figure 6 


The execution plan for DistributedQuery3 
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Query 1: Query cost (relative to the batch): 100% 
SELECT * FROM dbo.Tablei T1 INNER JOIN REMOTE.Scratch.dbo.SimpleView T2 ON T1.GUL.. 
E g ; J 
= 2 Joi e—a : — oi 
SELECT rge EA Compute Scalar ~ ~ Remote Query 
Cost: 0 4% ane Joan Cost: 0 4% Cost: 35% 
Cost: 1% 
E i 
£ 
ae Clustered Index Seek (Clustered) 
[Table1] .[(PK_Tablel__3214EC277Fé0E... 
Cost: 0% —s O: 
Cost: 0 $% 
Figure 7 


The execution plan after 5 was changed to 10 in DistributedQuery3’s WHERE clause 


e Rather than joining directly to Table2, the join is to 
Simple View—a view that contains the results of a 
SELECT statement run against Table2. 

e The join is on the GUID column rather than on 
the ID column. 

e The result set is now being restricted by Tablel’s 
ID rather than GUIDSTR. 


When executed, DistributedQuery3 produces the 
execution plan in Figure 6. In this plan, four rows are 
efficiently extracted from Tablel, and a nested-loop 
join brings in the relevant rows from SimpleView via 
a parameterized query. 

Because the restriction imposed by the WHERE 
clause has been changed from GUIDSTR to ID, it’s 
a good idea to try different values. For example, the 
result of changing the 5 to a 10 in the WHERE clause 
is somewhat shocking. The original query returns 
16 rows in around 10ms, but the modified query 
returns 36 rows in about 4 seconds—that’s a 40,000 
percent increase in runtime for a little over twice the 
data. The execution plan resulting from this seemingly 
innocuous change is shown in Figure 7. As you can 
see, SQL Server has decided that it’s a good idea to 
switch to the full-scan approach. 

If you were to execute this query against Table2 
rather than SimpleView, SQL Server wouldn’t switch 
approaches until there were around 12,000 rows from 
Table! selected by the WHERE clause. So what’s 
going on here? The answer is that for queries involving 
remote tables, SQL Server will query the remote 
instance for various metadata (e.g., index informa- 
tion, column statistics) when deciding on a particular 
data-access approach to use. When a remote view 
appears in a query, the metadata isn’t readily available, 
so SQL Server resorts to a best guess. In this case, the 
data-access approach it selected is far from ideal. 

All is not lost, however. You can help SQL Server 
by providing a hint. In this case, if you change the 
inner join to an inner remote join, the query returns 
36 rows in only 20ms. The resultant execution plan 
reverts back to one that looks like that in Figure 6. 
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Again, I feel it is judicious to stress that using 
hints shouldn't be taken lightly. The results can vary. 
More important, such a change could be disastrous if 
the WHERE clause were to include many more rows 
from Table1, so always weigh the pros and cons before 


Currently, SQL Server does a 


great job optimizing local queries 
but often struggles to make the right 
choice when remote data is brought 


into the mix. 


using a hint. In this example, a better choice would 
be to rewrite the query to reference Table2 directly 
rather than going through the view, in which case you 
might not need to cajole SQL Server into performing 
better. There will be instances in which this isn’t pos- 
sible (e.g., querying a third-party server with a strict 
interface), so it’s useful to know what problems to 
watch for and how to get around them. 


Stay Alert 

I hope that I’ve opened your mind to some of the 
potential pitfalls you might encounter when writing 
distributed queries and some of the tools you can use 
to improve their performance. Sometimes you might 
find that achieving acceptable performance from a 
distributed query isn’t possible, no matter how you 
spin it, in which case alternatives such as replication 
and log shipping should be considered. 

It became apparent during my research for this 
article that SQL Server does a great job optimizing local 
queries but often struggles to make the right choice 
when remote data is brought into the mix. This will 
no doubt improve given time, but for the moment you 
need to stay alert, watch for ill-performing distributed 
queries, and tweak them as necessary. SQL} 
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Altova DatabaseSpy 2010 


Database tools for developers and power users 


Itova’s DatabaseSpy 2010 is a curious blend of 

tools, capabilities, and features that target devel- 
opers and power users. Curious because the tools come at 
a great price—just $189 per license, and target all major 
databases on the market today—yet DatabaseSpy lacks 
any of the tools DBAs need to manage servers, data- 
bases, security, and other system-level requirements. 

In other words, DatabaseSpy is an inexpensive, 
powerful, multiplatform, database-level management 
solution. And that, in turn, means that evaluating its 
value and benefit must be contrasted against both dif- 
ferent audiences (DBAs versus developers and power 
users) and against different types of environments 
(heterogeneous or SQL Server—only). 


For Heterogeneous 
Environments 

For DBAs, developers, and power users operating in 
a heterogeneous environment, DatabaseSpy is worth 
a look. At the database level, it offers a great set of 
features, including the ability to script objects, query 
(and export) data, and create objects, such as tables, 
views, and sprocs and queries via designers or script. It 
also provides some unexpected features in the form of 
visual schema and data comparison tools. 

The big win that DatabaseSpy provides is a homog- 
enized interface and toolset that can connect to all major 
modern databases. For developers and power users, 
this offers a great way to tame some of the complexity 
that they might need to deal with in a heterogeneous 


ALTOVA DATABASESPY 2010 


Pros: Provides a great set of standardized features for interacting 
with a wide range of database platforms at the database level; offers great 
tools and options for developers and business users, especially in heteroge- 


neous environments 


Cons: No support for core DBA tasks such as server and database configura- 

tion and file or security management; UI feels cluttered at times (possibly just 
the default font size), and there’s a slight learning curve involved; XmlSpy Pro 
($499) is required for some advanced XML modification and interactions 


Rating: KAKKO 


Price: $189 per license 
Contact: Altova e 978-816-1606 © www.altova.com 


Recommendation: DatabaseSpy 2010 offers a compelling set of capabilities 
for developers and database power users at a great price. DBAs in heterogeneous 
environments will be able to take advantage of a homogenized set of tools at 
the database level, but they can’t use DatabaseSpy for common DBA tasks 
and management. 
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environment. Moreover, DatabaseSpy’s schema and data 
comparison and synchronization across multiple plat- 
forms is something that even DBAs can benefit from. 


For SQL Server-Only 
Environments 

For organizations that predominantly use SQL Server, 
DatabaseSpy’s benefits will largely depend on audience 
or job description. For example, I’ve never been fond of 
Visual Studio’s support for SQL Server development, and 
I think that many developers using SQL Server would 
prefer DatabaseSpy’s UI and features over the paltry 
tools provided by Visual Studio. And, at $189 per license, 
the additional tools (such as change script management, 
schema and data comparison, and code formatting and 
editing capabilities) are a great buy. Likewise, although 
there will be some learning curve involved, business users 
might also prefer using DatabaseSpy in many cases. 

But for DBAs who manage only SQL Server 
deployments, DatabaseSpy will fall flat. In addition 
to not providing any server or database management 
tools or capabilities, DatabaseSpy’s tree view shows 
only one database at a time and doesn’t provide any 
tools for managing anything other than tables, views, 
sprocs, user-defined functions, and XML schemas. 

In terms of managing XML schemas, I was hoping 
that DatabaseSpy would be a great way for SQL 
Server DBAs to pick up some cheap tools to allow 
more powerful editing capabilities and options. But 
although it provides some XML editing capabilities, 
schema management is handled by XmlSpy Pro, which 
costs an additional $499. Moreover, DatabaseSpy 
doesn’t extend any of the functionality provided by 
XmlSpy, which offers native support for connecting to 
and managing XML schemas defined in SQL Server. 
Consequently, not even DatabaseSpy’s XML features 
provide much benefit to DBAs, leaving DatabaseSpy 
pretty much a non-starter for SQL Server DBAs. 

DatabaseSpy isn’t for most DBAs. But its small 
price tag and great feature set make it a compelling con- 
sideration for developers and power users. Moreover, 
because it’s so heavily database centered, DBAs can use 
DatabaseSpy to help grant developers and power users 
access to a database without as much worry that people 
will poke around where they shouldn’t (although that’s 
no substitute for proper security). Consequently, even 
though DatabaseSpy is missing DBA-specific tools, 
it’s well worth a test drive. For what it offers, it’s well 
executed and feature-rich. SQL 
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SQL Server 2005 SP4: 


sow PE tonr What You Need to Know 


Your guide to sponsored resources 


Learn the Essential 

Tips to Optimizing SSIS 
Performance 

SSIS is a powerful and flexible data integration 
platform. With any dev platform, power and 
flexibility = complexity. There are two ways to 
manage complexity: mask/manage the complexity 
or gain expertise. Join expert Andy Leonard for 
this on-demand web seminar as he discusses 
workarounds for three areas of SSIS complexity: 
heterogeneous data sources, performance tuning, 
and complex transformations. 
sqlmag.com/go/SSIS 


Protecting SQL Server Data 
How protected is your mission-critical data? 

Are you ready for any kind of outage? If not, 
what will the cost be to your business, and can 
your business (and your resume) afford it? Join 
SQL Server expert, Michael K. Campbell for an 
independent overview of key considerations for 
protecting your SQL Server data. 
sqlmag.com/go/seminar/ProtectSQLData 


SQL Server Consolidation, 
eLearning Series 
Join SQL Server experts Allan Hirt and Ben DeBow 
for 6 in-depth sessions and: 
Understand the benefits behind SQL Server 
consolidation 
Learn how to plan a SQL Server consolidation 
effort 
Understand how administering a consolidated 
SQL Server environment changes your current 
DBA-related tasks 
Register today: sqlmag.com/go/ 
SOLServerConsolidationeLearning 
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Q: a Microsoft Connect post (tinyurl.com/ykteu3z), one 
user asked for a headcount concerning who would prefer 
Microsoft release an SP4 for SQL Server 2005. The vote? 1085 
yea, 1 nay. Over 100 comments echo the same sentiments: “I 
really hope that Microsoft will release a SP4 for SQL Server 
2005 soon.” 

Fortunately, it appears that Microsoft has heard the cries of 
DBAs and is planning to release SQL Server 2005 SP4. Here’s 
what you need to know: 

1. When will SP4 be released? The current target is Q4 2010. 

2. What will change in SP4? There aren’t any specifics 
on this, but I doubt we'll see much other than bug fixes 
and cleanup. I wouldn’t expect any significant functionality 
improvements, considering that Microsoft is pushing SQL 
Server 2008 R2 this year (release date in May 2010) as well. 

3. With the expectation of 


Brian Reinholz 


(breinholz@ windowsitpro.com) is production 
editor for Windows IT Pro and SQL Server 
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certification. 


2005 SP4, should I skip 2008? 
That depends on your com- 
pany. Many companies haven’t 
upgraded to SQL Server 2008 
because of (a) compatibility 
issues with existing programs or 
(b) cost and hassle. And if your 
company just recently moved 
to SQL Server 2005, you're 
probably not ready for yet 
another upgrade. 


R2 will be. 


Every company is different—I recommend reading some of the 

recent articles we’ve posted on new features to see what benefits 

you'd gain from the upgrade so you can make an informed 

decision: 

e “Microsoft’s Mark Souza Lists His Favorite SQL Server 
2008 R2 Features” (InstantDoc ID 103648) 

e “T-SQL Enhancements in SQL Server 2008” (InstantDoc 
ID 99416) 

e “SQL Server 2008 R2 Editions” (InstantDoc ID 103138) 


One of the most anticipated features (according to SQL 
Server Magazine readers) of SQL Server 2008 R2 is Power- 
Pivot. Check out Derek Comingore’s blog at InstantDoc ID 
103524 for more information on this business intelligence (BI) 
feature. 

One last thing to consider when mulling over an upgrade is 
this: We don’t know when the next SQL Server version after 2008 
R2 will be. Microsoft put a 5-year gap between 2000 and 2005, 
but only 3 years between 2005 and 2008. They’ve been trying to 
get on a three-year cycle between major releases. So will the next 
version be 2011? We really don’t know yet, but it could be a long 
wait for organizations still on SQL Server 2005. SQL 
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Left-Brain.com is the newly launched online superstore stocked with 
educational, training, and career-development materials focused on 
meeting the needs of SQL Server professionals like you. 


> SOL Server 08 System Views Featured Product: 
= SQL Server 2008 System Views Poster 


Face the migration learning curve head on with the SQL Server 
2008 System Views poster. An updated full-color print diagram of 
Catalog views, dynamic management views, tables, and objects for 
SOL Server 2008 (including relationship types and object scope), 
this poster is a must-have for every SQL DBA migrating to or al- 
ready working with SQL Server 2008. 


Order your full-size, print copy 
today for only $14.95*! 


*Plus shipping and applicable tax. 


brar 
www.left-brain.com Kn iServer 


DATA MIGRATION 

Import and Export Database Content 
DB Software Laboratory has released Data Exchange Wizard ActiveX 3.6.3, the latest version of the 
company’s data migration software. According to the vendor, the Data Exchange Wizard lets devel- 
opers load data from any database into another, and is compatible with Microsoft Excel, Access, DBF 
and Text files, Oracle, SQL Server, Interbase/Firebird, MySQL, PostgreSQL, or any ODBC-compliant 
database. Some of the notable features include: a new wizard-driven interface; the ability to extract 
and transform data from one database to another; a comprehensive error log; cross tables loading; the 
ability to make calculations while loading; SQL scripts support; and more. The product costs $200 for 
a single user license. To learn more or download a free trial, visit www.dbsoftlab.com. 


D) Editor’s Tip 


Got a great 
new product? 
Send announce- 
ments to products @ 
sqlmag.com. 
— Brian Reinholz, 
production editor 


DATABASE ADMINISTRATION 
Ignite 8 Brings Comprehensive Database 
Performance Analysis 

Confio Software has released Ignite 8, a major 
expansion to its database performance solution. The 


ignite 
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most significant enhancement to the product is the 
addition of monitoring capabilities, such as tracking 3 
server health and monitoring current session activity. gu 
These features combine with Ignite’s response-time 
analysis for a comprehensive monitoring solution. 
New features include: historical trend patterns; 
reporting to show correlation of response-time with 
health and session data; an alarm trail to lead DBAs 
from a high-level view to pinpoint a root cause; and 
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more. Ignite 8 costs $1800 per monitored instance. To 
learn more about Ignite 8 or download a free trial, visit www.confio.com. 


PERFORMANCE AND MONITORING 

SafePeak Caching Solution Improves Performance 

DCF Technologies has announced SafePeak 1.3 for SQL Server. According to the vendor, 
SafePeak’s plug-and-play solution dramatically accelerates data access and retrieval, providing 
immediate results with no need to change existing databases or applications. SafePeak acts as a proxy 
cache between application servers and database servers. Auto-learning caching algorithms store SQL 
statement results in SafePeak memory, greatly reducing database server load. SafePeak lets you do 
the following: configure and add cache policy rules; manage rules such as query pattern definition, 
default timeout, eviction scheduling, and object dependencies for eviction; report on performance 


and data usage statistics; and configure and add multiple SQL [Ei 


i isi \entbhO8 = DR = yan = Dok = Verin Crid = tso = 
Server instances. To learn more, visit www.safepeak.com. CITEZ ISANE 
= JSION-D32 (Nome - = 
DATA MIGRATION E Heal 
Migration Programs Now Compatible with SQL Server bel Bo 
Nob Hill Software has announced support for SQL Server for two Some baa 
products: Nob Hill Database Compare and Columbo. Database oe 


Compare is a product that looks at two different databases and can 
migrate the schema and data to make one database like the other. 
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Columbo compares any form of tabular data (databases, Microsoft a. ae 
Office files, text, or XML files). Like Database Compare, Columbo CETE 


can also generate migration scripts to make one table like another. 
Database Compare costs $299.95 and Columbo costs $99.95, 
and free trials are available for both. To learn more, visit www 
.nobhillsoft.com. SOL 
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Fam 6 Essential FAQs about 


Microsoft Silverlight 


ilverlight is one of the new web development 

technologies that Microsoft has been promoting. 
As a database professional, you might be wondering 
what Silverlight is and what it offers to those in the 
SQL Server world. I’ve compiled six FAQs about 
Silverlight that give you the essentials on this new 
technology. 


What is Silverlight and is it 
free? 
Microsoft Silverlight is essentially a technology for 
building Rich Internet Applications (RIA). Silverlight 
applications are based on the .NET Framework and 
are an alternative to using Adobe Flash. Silverlight 
provides the ability to deliver streaming media and 
interactive applications for the web and is delivered 
as a browser plug-in. 

Silverlight 3 is Microsoft’s most recent release, but 
Silverlight 4 is currently in beta and is expected to be 
available in mid-2010. Yes, it’s free. 


What browsers does it 
support? 
Silverlight 3 supports Internet Explorer, Firefox, and 
Safari. Silverlight 4 will add Google Chrome to this list. 
Both Windows and Mac have a Silverlight runtime. 
At this time there’s no Silverlight support for Linux. 
However, Mono is developing an open-source imple- 
mentation of Silverlight named Moonlight. (Mono 
has also developed an open-source implementation of 
the .NET Framework.) You can find out more about 
Moonlight at www.mono-project.com/moonlight. 


What can database developers 
do with it? 

Earlier versions of Silverlight were targeted more 
toward rich media delivery from the web. However, 
Silverlight 3 includes support for data binding with 
SQL Server, making Silverlight a viable choice for 
building RIA business applications on the web. 
Silverlight 4 will provide drag-and-drop support for 
data binding with several built-in controls, including 
the listbox and datagrid controls. 


You can learn more about Silverlight data 
binding at msdn.microsoft.com/en-us/library/ 
cc278072(VS.95).aspx. In addition, Visual Studio 
(VS) 2010 will include a new Business Application 
Template to help you build line-of-business Silver- 
light applications. 


Can it access resources on the 
client system? 

Yes. Silverlight supports building both sandboxed 
and trusted applications. Sandboxed applications are 
restricted to the browser. Trusted Silverlight applica- 
tions are able to access client resources. 

Silverlight 4 will extend Silverlight’s reach well 
into the client desktop and will include the ability 
to print as well as read and write to the My 
Documents, My Pictures, My Music and My Videos 
folders. Silverlight 4 also can access installed programs 
such as Microsoft Excel and interact with system 
hardware such as the camera, microphone, and 
keyboard. 


How can I tell if I have Silverlight 
installed and how do I know 
what version Pm running? 

The best way to find out if Silverlight is installed 
is to point your browser to www.microsoft.com/ 
getsilverlight/Get-Started/Install/Default.aspx. 

When the Get Microsoft Silverlight webpage 
loads it will display the version of Silverlight that’s 
installed on your client system. For instance, on my 
system the webpage showed I was running Silverlight 
3 (3.0.50106.0). 


How do I start developing 
Silverlight applications? 
Silverlight applications are coded in XAML and 
created using VS 2008/2010, Microsoft Expression 
Studio, or the free Microsoft Visual Web Developer. 
You can download the Silverlight runtime, Silverlight 
development tools, and guides at silverlight.net/ 
getstarted. SQL 
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SOL Sentry products are made by DBAs, for DBAs. They provide unparalleled insight, awareness and control over your 
SOL Server environment. With features like real-time and historical performance analysis and visual schedule 
management, SOL Sentry is the ultimate visibility solution. 


IDERA SQL virtual database" USER: 


SQL virtual database” 7} . 
Access data in SQL Server backup files without restoring | c) = e) 


